After completing a routine apt upgrade on my server and laptop and rebooting everything, I found that
$ mount nfs-server:/mnt /mountpoint was just hanging forever. There didn't seem to be any logical explanation why.
I first turned to Wireshark for insight:
140 60.439219227 192.168.0.2 192.168.0.3 NFS 170 V4 Reply (Call In 138) EXCHANGE_ID 141 60.439302740 192.168.0.3 192.168.0.2 NFS 258 V4 Call (Reply In 142) CREATE_SESSION 142 60.439984105 192.168.0.2 192.168.0.3 NFS 194 V4 Reply (Call In 141) CREATE_SESSION 143 60.440070415 192.168.0.3 192.168.0.2 NFS 198 V4 Call (Reply In 147) PUTROOTFH | GETATTR 147 65.511499058 192.168.0.2 192.168.0.3 NFS 158 V4 Reply (Call In 143) PUTROOTFH Status: NFS4ERR_DELAY Huh. So it's kinda working, but something's not completely happy. What's the holdup?
I next turned to dmesg, which promptly moved me past the "go round in circles looking for clues indeterminately" stage, all at once:
[ 283.998430] rpc.mountd[2238]: segfault at 0 ip 00007f816550f3d6 sp 00007ffd60245820 error 4 in libc-2.28.so[7f81654b7000+148000] [ 283.998523] Code: 1f 44 00 00 85 f6 0f 8e 88 00 00 00 83 fe 01 0f 84 8f 00 00 00 41 54 83 ee 01 49 89 fc 41 b8 01 00 00 00 55 b9 0a 00 00 00 53 <8b> 02 48 89 d3 89 c5 83 e0 df 89 02 48 63 d6 48 89 fe 48 89 df 83 That occurred almost immediately after startup. I wonder what's going on here...
How do I start this myself? Some digging around in unit files suggests it wants the --manage-gids argument, and squinting at --help mentions -F to run it in the foreground. Ok.
# rpc.mountd --manage-gids -F rpc.mountd: Version 1.3.3 starting Segmentation fault Well that didn't take too long to reproduce! What justification does it have for doing that, then?!
stat("/", {st_mode=S_IFDIR|0755, st_size=37, ...}) = 0 openat(AT_FDCWD, "/etc/mtab", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) futex(0x7faf15dd5168, FUTEX_WAKE_PRIVATE, 2147483647) = 0 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} --- +++ killed by SIGSEGV +++ Segmentation fault ...oh no, my system hasn't hosed itself, has it? :S
# ls -l /etc/mtab lrwxrwxrwx 1 root root 19 Jul 12 15:17 /etc/mtab -> ../proc/self/mounts ...OK, whew.
Hmm. I'll bet that rogue openat() is actually coming from a library call. I wonder if ltrace adds any additional interesting context?
__xstat(1, "/", 0x7ffe45d9f620) = 0 free(0) = <void> strdup("/") = 0x559b3adaf640 strlen("/") = 1 setmntent("/etc/mtab", "r") = 0 getmntent(0 <no return ...> --- SIGSEGV (Segmentation fault) --- +++ killed by SIGSEGV +++ Hrmmm, what are--
DESCRIPTION These routines are used to access the filesystem description file /etc/fstab and the mounted filesystem description file /etc/mtab. The setmntent() function opens the filesystem description file filename and returns a file pointer which can be used by getmntent(). The argu- ment type is the type of access required and can take the same values as the mode argument of fopen(3). The getmntent() function reads the next line of the filesystem descrip- tion file from stream and returns a pointer to a structure containing the broken out fields from a line in the file. The pointer points to a static area of memory which is overwritten by subsequent calls to getm- ntent(). ... RETURN VALUE The getmntent() and getmntent_r() functions return a pointer to the mn- tent structure or NULL on failure. Hang on: setmntent() is returning 0, aka a NULL pointer. And then that NULL pointer is being used in getmntent(). That sounds broken...?
I wonder what's really going on here. Wouldn't it be great if whatever package rpc.mountd is in, had debug symb... oh, awesome, nfs-kernel-server-dbgsym is a thing. Cool.
Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7db73d6 in __GI___fgets_unlocked (buf=buf@entry=0x5555555a8060 "", n=n@entry=4096, fp=fp@entry=0x0) at iofgets_u.c:50 50 iofgets_u.c: No such file or directory. (gdb) bt #0 0x00007ffff7db73d6 in __GI___fgets_unlocked (buf=buf@entry=0x5555555a8060 "", n=n@entry=4096, fp=fp@entry=0x0) at iofgets_u.c:50 #1 0x00007ffff7e2ef16 in __GI___getmntent_r (stream=stream@entry=0x0, mp=mp@entry=0x7ffff7efb140 , buffer=0x5555555a8060 "", bufsiz=bufsiz@entry=4096) at mntent_r.c:130 #2 0x00007ffff7e2ed03 in getmntent (stream=stream@entry=0x0) at mntent.c:52 #3 0x000055555555cfa8 in next_mnt (v=0x555555571ac8 , p=0x5555555a4ff8 "/") at cache.c:383 #4 nfsd_fh (f=6) at cache.c:736 #5 0x000055555555d2cd in cache_process_req (readfds=readfds@entry=0x7fffffffe960) at cache.c:1424 #6 0x000055555555d718 in my_svc_run () at svc_run.c:117 #7 0x0000555555558ee3 in main (argc=, argv=) at mountd.c:894
Hmm.
367 /* Iterate through /etc/mtab, finding mountpoints 368 * at or below a given path 369 */ 370 static char *next_mnt(void **v, char *p) 371 { 372 FILE *f; 373 struct mntent *me; 374 size_t l = strlen(p); 375 if (*v == NULL) { 376 f = setmntent("/etc/mtab", "r"); 377 *v = f; 378 } else 379 f = *v; 380 while ((me = getmntent(f)) != NULL && l > 1 && 381 (strncmp(me->mnt_dir, p, l) != 0 || 382 me->mnt_dir[l] != '/')) 383 ; 384 if (me == NULL) { 385 endmntent(f); 386 *v = NULL; 387 return NULL; 388 } 389 return me->mnt_dir; 390 } Hmmmmm.
So... setmntent() is returning NULL, that's being assigned to f, then getmntent(f) is... segfaulting...?
Let's have a look.
$ cat setmntent.c #include <stdio.h> #include <mntent.h> int main() { FILE *ret = setmntent("/etc/mtab", "r"); printf("%p\n", (void *)ret); return 0; } $ gcc -o setmntent setmntent.c $ ./setmntent (nil) Yay. So on my system, because /etc/mtab is a symlink, setmntent() asplodes. Great.
And let me guess...
$ cat getmntent.c #include <mntent.h> int main() { getmntent(NULL); } $ gcc -o getmntent getmntent.c $ ./getmntent Segmentation fault Well that's thoroughly logical, then. *Headdesk*
/etc/mtabsymlinked to../proc/self/mounts, does not exhibit this problem with thesetmntenttest program. My server is however using ZFS. Something to potentially investigate further.