Recently on a few machines, mount points have started to disappear(One mountpoint per server, at random intervals and random machines) and I can find nothing in the logs. I have five mountpoints and randomly any of them will go away. There is no relation between the disappearance and the mountpoint protocol (both TCP and UDP mounts will disappear).
What I have not tried
Run tcp dump continiously (and am reluctant to do so, since this issue happens once every 2-3 days...)
Info about the machines:
NFS booted, the boot server is FreeBSD 11.0. (nothing in its logs btw) rootfs options are:
(rw,noatime,vers=3,rsize=131072,wsize=131072,namlen=255,hard,nolock,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=ADDRESS,mountvers=3,mountport=677,mountproto=udp,local_lock=all,addr=ADDRESS)
OS is CentOs7, running the 4.11.0-1 ML kernel. Example mount options:
rw,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=ADDRESS,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=ADDRESS) (rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=udp,timeo=11,retrans=3,sec=sys,mountaddr=ADDRESS,mountvers=3,mountport=4002,mountproto=udp,local_lock=none,addr=ADDRESS,_netdev)
Info about the NFS shares and server
I have in total 5 distinct NFS servers, load balancing is done over DNS, all export the same mountpoints(a few shares UDP, a few TCP). The server running the NFS server is RHEL 6.7(Santiago), kernel version 2.6.32-573.el6, snfs-common/server/client is 4.7.3. As you have also guessed nothing in the server logs relating to this problem. Example of export options:
(rw,async,root_squash,no_wdelay,no_subtree_check,fsid=ID)
Things I have tried so far:
My first assumption is that I have a proccess which calls umount or umount2, through some bizzare reason on a random NFS share, after tracing with sysdig for both unlinkat, unlink, unmount and remove systemcalls, I can only see systemd-logind doing unmount2 when a user session is destroyed, but not on the mountpoints. The sysdig filter I used in a chisel is posted below:
function on_init() local filename = path for i in string.gmatch(path, "[^/]+") do filename = i end print("PID\tPROC_NAME\tPROC_EXEC\tPROC_SID\tPROC_PNAME\tPROC_PPID\tPROC_EXELINE\tPROC_PCMDLINE") chisel.set_event_formatter("%proc.pid\t%proc.name\t%proc.exec\t%proc.sid\t%proc.pname\t%proc.ppid\t%proc.exeline\t%proc.pcmdline ") chisel.set_filter( "(evt.type=unlinkat and evt.arg.name=" .. path .. ") or \ (evt.type=unlink and evt.arg.path=" .. path .. ") or \ (evt.type=umount) or \ (evt.type=remove and evt.arg.path=" .. path .. ")") return true end
The unmount happened again randomly, but the filter was unable to see that. Thinking the filter is flawed I created a program that unmounts a share both with umount and umount2 (tried both the lazy and the force umount flags) and the filter detected them correctly, so this leaves me to believe that the kernel is umounting things.
I have nothing in my logs not even the usual "nfs not responding" message when there is a problem with the share.
If I login on a machine and remount, the remount is successful without any problem.
I have numerous clients running from the same setup and this does not happen there. The only thing this group of machines have in common is their network segment and the NFS boot server. But I fail to see why absolutely nothing will be reported if communication between the server and the client died.
/etc/fstabentry or whatever ...server:nfs_share /mountpoint nfs rw,hard,intr,nfsvers=3 0 0