I have 19 CentOS Linux virtual servers for a specific solution. These servers are distributed in three network areas, C, D, S.
The NFS server is located in zone S and is configured to exposie a share, mountable for the other severs in all three network zones. Firewall and routing rules between C <-> S and D <-> S are established and working. The NFS share has been confirmed to work in the past.
The NFS settings are configured on one specific server of the solution, distributing the settings to all other servers of the solution. For example, after setup it mounts the share automatically on all of them and adjusts the /etc/fstab file on all 19 CentOS servers.
The problem is intermittently, the mounted share disappears on some of the 19 servers. I don't know why.
However, I am not able to remount it manually with the mount command. Additionally, there are issues like df -h not responding, CD into the mount point resulting in the SSH session hang, or I see packets with checksum errors via tcpdump, when manually mounting the share again.
A reboot of the server clears the problem and the share is automatically mounted again.
I would like to configure the NFS stuff more resilient.
What I found out so far:
- df -h hangs, but can be terminated via CTRL+C
- cd into mount point freezes SSH session
- umount -f MOUNTPOINT returns busy device
- umount -l MOUNTPOINT works
- manual mount via mount -t nfs IP:SHARE MOUNTPOINT doesn't work resp. runs infinite
- mount | grep nfs > only sunrpc on ...
- nfsstat -m > returned nothing
- uname -r > 3.10.0-1160.95.1.e17.x86_64
- ps aux | grep " D " > root PID 0.0 0.0 0 0 ? D Aug13 0:00 [NFS_SERVER_IP-man]
- Vendor doesn't know what's going on
- cat /etc/fstab > NFS_IP:SHARE MOUNTPOINT nfs defaults 0 0
- ps -e | grep nfs > nfsiod and nfsv4.1-svc
Based on my limited understanding, the kernel thread is stuck and cannot be recovered on this old OS. Additionally, because of network issues the NFS connection couldn't be reestablished automatically, but that can be changed with specific settings in /etc/fstab (e.g. soft).
Would appreciate any help to solve the problem and technical hints, I can use to communicate the problem internally and with the vendor.
softoption. And if you're running executables or any of the apps usemmap()to read data from the NFS filesystem, IO failures will cause corrupted reads.