How do I get out of this situation safely?
Details are as follow:
A xen server has got block devices allocated to VMs. But these devices have also been mounted inside Xen.
In fact 44 of these block devices have been mounted like this. To make matters worse, each physical device is seen over 4 paths and each of those are mounted on a separate mountpoint. In other words the devices are actually mounted 5 times each.
The VM guest OS sees the path via a PowerPath pseudo device (allocated as a phy: block device to the domU)
Some of the devices are formatted as ext2 and reiserfs.
No need to explain to me the file system corruption risks involved here.
I am afraid that even just unmounting the file systems may cause corruption, and feel that at this point pulling the power from the host, is the safest option.
Note that the applications, Oracle databases for the most part, in all the VMs are still running and in use.
I discovered this when investigating high CPU usage on the dom0. There is an unkillable "find" process, with cwd -> /media/disk-12 which is mounted from /dev/sdf1, which belongs to /dev/emcpowerr
Before anybody asks, the one time I've seen processes cannot be killed and continue to use CPU and RAM (unlike a defunct/zombie process), is when there is outstanding commited I/Os, eg sync returned but not physically on disk yet. More commonly this occurs on tape I/O.
Suggestions!?
P.S. I would have expected devices to be "reserved" once mounted, to prevent this kind of thing? Or is that not possible on Linux?
EDIT: Firstly I am convinced that KDE within the hypervisor) is the culprit. It looks like KDE is mounting the devices it can on logging to create desktop icons. The same thing is however not happening on other Xen servers, but all the other servers are running a much older version of SLES and KDE ... V4 appears to be the offending one, with 3.4 behaving better).
Furthermore two non-critical VMs have become hung. After shutting them down they would not boot up again due to file system corruption. The main/production VM is still running and the database on it still working, but clearly this is a time bomb. The customer is attempting to re-build the environment on another VM on another server but is stuck on issues configuring some of the components, so we are waiting...
In any case I feel that none of the answers have so far been more than "best practice is always shut down gracefully" And I hope to get something more concrete... In any case, I feel that this situation may warrant some more careful thinking. Will shutting down cause outstanding IO, in particular file system meta data updates from the hypervisor, to be synced and cause potentially major file system corruption?