Quantcast
Channel: VMware Communities: Message List
Viewing all articles
Browse latest Browse all 168455

ESXi 6.0 hosts become unresponsive

$
0
0

Hi everybody!

 

 

I have a problem with VMware ESXi 6.0.

I have a VMware Cluster with 3 ESXi 6.0 host. Yesterday evening 2 ESXi host became unresponsive. The affected ESXi hosts, responds to ping, but disconnect vCenter, cannot connect direct to host with vSphere client and unresponsive on DCUI. The VMs - which running the affected hosts - became unresponsive (VMware HA doesn't reboot VMs, because the host locked the VMs file). Only workaround: hard reset the hosts. After hard reset the hosts, HA restart VMs on another host, and the affected host working normal. The problem occurd when high I/O (backup, file-level, inside VM) on HBAs.

 

 

In the /var/log/vmkernel.log I see a lot of messages at the "crash" time:

WARNING: lpfc: lpfc_sli_issue_abort:9956: 1:3169 Abort failed: Abort INP: Data: x0 xcd0 x8 x98

ScsiPath: 7133: Set retry timeout for failed TaskMgmt abort for CmdSN  0x0, status Failure, path vmhba5:C0:T0:L0

 

 

The hosts configuration:

Host type: IBM x3850 X5

VMware version: Lenovo Customized ESXi 6.0 + VMware ESXi 6.0 Express Patch 2

FC: 2 * Emulex LightPulse FC SCSI 10.4.236.0 IBM 42D0494 8Gb 2-Port PCIe FC HBA for System x Emulex firmware version: 2.02X11 Emulex driver version: 10.4.236.0-1OEM.600.0.0.2159203

Hosts firmware versions are the latest.

VMware installed on USB key (Clean install, Not upgraded), LOG dir on FC Datastore.

The storage and FC switches side have no error/warning messages.

 

 

I see the VMware KB 2086025 and 2125904. In this KB articles the symptoms are very similar to our situation, but our hosts have newer Emulex driver version (KB articles: version earlier than 10.2.340.18, our version 10.4.236.0)

 

I tried the latest Emulex firmware (version: 10.6.126.0, install & restart host) but the host become unresponsive again and the log same as earlier.

 

Today a new problem, when I collect diagnostic info (Export Logs) from the host:

  • first host: disconnect from vCenter for seconds 3 times (flapping state), and the log download failed, when the host disconnect, the VMs (which running this host) not responding on LAN
  • second host: log download start, after 10 minutes purple screen:

purple_screen.png

 

 

I didn't find any solution.

 

 

Any ideas?

 

 

Thanks for your help!


Viewing all articles
Browse latest Browse all 168455

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>