Hi everybody!
I have a problem with VMware ESXi 6.0.
I have a VMware Cluster with 3 ESXi 6.0 host. Yesterday evening 2 ESXi host became unresponsive. The affected ESXi hosts, responds to ping, but disconnect vCenter, cannot connect direct to host with vSphere client and unresponsive on DCUI. The VMs - which running the affected hosts - became unresponsive (VMware HA doesn't reboot VMs, because the host locked the VMs file). Only workaround: hard reset the hosts. After hard reset the hosts, HA restart VMs on another host, and the affected host working normal. The problem occurd when high I/O (backup, file-level, inside VM) on HBAs.
In the /var/log/vmkernel.log I see a lot of messages at the "crash" time:
WARNING: lpfc: lpfc_sli_issue_abort:9956: 1:3169 Abort failed: Abort INP: Data: x0 xcd0 x8 x98
ScsiPath: 7133: Set retry timeout for failed TaskMgmt abort for CmdSN 0x0, status Failure, path vmhba5:C0:T0:L0
The hosts configuration:
Host type: IBM x3850 X5
VMware version: Lenovo Customized ESXi 6.0 + VMware ESXi 6.0 Express Patch 2
FC: 2 * Emulex LightPulse FC SCSI 10.4.236.0 IBM 42D0494 8Gb 2-Port PCIe FC HBA for System x Emulex firmware version: 2.02X11 Emulex driver version: 10.4.236.0-1OEM.600.0.0.2159203
Hosts firmware versions are the latest.
VMware installed on USB key (Clean install, Not upgraded), LOG dir on FC Datastore.
The storage and FC switches side have no error/warning messages.
I see the VMware KB 2086025 and 2125904. In this KB articles the symptoms are very similar to our situation, but our hosts have newer Emulex driver version (KB articles: version earlier than 10.2.340.18, our version 10.4.236.0)
I tried the latest Emulex firmware (version: 10.6.126.0, install & restart host) but the host become unresponsive again and the log same as earlier.
Today a new problem, when I collect diagnostic info (Export Logs) from the host:
- first host: disconnect from vCenter for seconds 3 times (flapping state), and the log download failed, when the host disconnect, the VMs (which running this host) not responding on LAN
- second host: log download start, after 10 minutes purple screen:
I didn't find any solution.
Any ideas?
Thanks for your help!