I think, there should be some type of memory corruption. The problems happen mainly on processes allocating large amount of ram (especially browsers, and cc1plus). JVMs with intensive ram usage die nearly on the start, Eclipse can't even create its first window. The problem is nearly always segfault caused by nullpointer-dereference.
What I experienced:
- Windows and OSX running the same processes in ESXi, doesn't show the phenomenon.
- But: Linux with different kernels (between 3.6 and 4.1) shows. Linux with different distributions (I've tried debian, ubuntu and opensuse) also shows.
- 32-bit guests aren't problematic even if they are Linux.
- Although PCI passthrough is activated, without it (and even with turned off IOMMU in the BIOS) the problem is coming.
- Upgrading esxi 5.5 to 6 didn't help anything.
Other infos:
- I've runned memtest a night long on the host machine, and it didn't find any problem.
- Using 64-bit OS on the host (thus, without virtualization), the problem isn't coming.
- The crashes are happening mainly after the process allocated a large amount of memory.
- The problem happens nearly always in user space (but its reason could be that kernelspace allocates large amount of ram only seldom).
- Turning off acceleration, or switching the guest os to "other linux" or "other 64-bit OS" didn't help.
- Turning off the SMP on the guest machine (i.e. giving them only a single CPU core) didn't help. Playing with other CPU settings (various multi socket / multicore settings, limiting the guest to various CPU cores) didn't had any effect.
- Limiting the Guest OS memory also didn't help, although Guests running only a few processes with low memory consumption work seamslessly.
- Just after poweron the problem doesn't happen. It is coming only after there is some large memory allocation-free cycle happened (for example, only after some minutes of clicking in a firefox).
- Changing the virtual SCSI controller type from LSI to vmware paravirtual, or turning off swap didn't help.
I think, something corrupts (zeroes) the guest OSes memory on large brk() calls.
Did anybody meet this problem? What other could be done to make the hunting more efficient?