0

After completing a routine apt upgrade on my server and laptop and rebooting everything, I found that

$ mount nfs-server:/mnt /mountpoint 

was just hanging forever. There didn't seem to be any logical explanation why.

I first turned to Wireshark for insight:

140 60.439219227 192.168.0.2 192.168.0.3 NFS 170 V4 Reply (Call In 138) EXCHANGE_ID 141 60.439302740 192.168.0.3 192.168.0.2 NFS 258 V4 Call (Reply In 142) CREATE_SESSION 142 60.439984105 192.168.0.2 192.168.0.3 NFS 194 V4 Reply (Call In 141) CREATE_SESSION 143 60.440070415 192.168.0.3 192.168.0.2 NFS 198 V4 Call (Reply In 147) PUTROOTFH | GETATTR 147 65.511499058 192.168.0.2 192.168.0.3 NFS 158 V4 Reply (Call In 143) PUTROOTFH Status: NFS4ERR_DELAY 

Huh. So it's kinda working, but something's not completely happy. What's the holdup?

I next turned to dmesg, which promptly moved me past the "go round in circles looking for clues indeterminately" stage, all at once:

[ 283.998430] rpc.mountd[2238]: segfault at 0 ip 00007f816550f3d6 sp 00007ffd60245820 error 4 in libc-2.28.so[7f81654b7000+148000] [ 283.998523] Code: 1f 44 00 00 85 f6 0f 8e 88 00 00 00 83 fe 01 0f 84 8f 00 00 00 41 54 83 ee 01 49 89 fc 41 b8 01 00 00 00 55 b9 0a 00 00 00 53 <8b> 02 48 89 d3 89 c5 83 e0 df 89 02 48 63 d6 48 89 fe 48 89 df 83 

That occurred almost immediately after startup. I wonder what's going on here...

How do I start this myself? Some digging around in unit files suggests it wants the --manage-gids argument, and squinting at --help mentions -F to run it in the foreground. Ok.

# rpc.mountd --manage-gids -F rpc.mountd: Version 1.3.3 starting Segmentation fault 

Well that didn't take too long to reproduce! What justification does it have for doing that, then?!

stat("/", {st_mode=S_IFDIR|0755, st_size=37, ...}) = 0 openat(AT_FDCWD, "/etc/mtab", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory) futex(0x7faf15dd5168, FUTEX_WAKE_PRIVATE, 2147483647) = 0 --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} --- +++ killed by SIGSEGV +++ Segmentation fault 

...oh no, my system hasn't hosed itself, has it? :S

# ls -l /etc/mtab lrwxrwxrwx 1 root root 19 Jul 12 15:17 /etc/mtab -> ../proc/self/mounts 

...OK, whew.

Hmm. I'll bet that rogue openat() is actually coming from a library call. I wonder if ltrace adds any additional interesting context?

__xstat(1, "/", 0x7ffe45d9f620) = 0 free(0) = <void> strdup("/") = 0x559b3adaf640 strlen("/") = 1 setmntent("/etc/mtab", "r") = 0 getmntent(0 <no return ...> --- SIGSEGV (Segmentation fault) --- +++ killed by SIGSEGV +++ 

Hrmmm, what are--

DESCRIPTION These routines are used to access the filesystem description file /etc/fstab and the mounted filesystem description file /etc/mtab. The setmntent() function opens the filesystem description file filename and returns a file pointer which can be used by getmntent(). The argu- ment type is the type of access required and can take the same values as the mode argument of fopen(3). The getmntent() function reads the next line of the filesystem descrip- tion file from stream and returns a pointer to a structure containing the broken out fields from a line in the file. The pointer points to a static area of memory which is overwritten by subsequent calls to getm- ntent(). ... RETURN VALUE The getmntent() and getmntent_r() functions return a pointer to the mn- tent structure or NULL on failure. 

Hang on: setmntent() is returning 0, aka a NULL pointer. And then that NULL pointer is being used in getmntent(). That sounds broken...?

I wonder what's really going on here. Wouldn't it be great if whatever package rpc.mountd is in, had debug symb... oh, awesome, nfs-kernel-server-dbgsym is a thing. Cool.

 Program received signal SIGSEGV, Segmentation fault. 0x00007ffff7db73d6 in __GI___fgets_unlocked (buf=buf@entry=0x5555555a8060 "", n=n@entry=4096, fp=fp@entry=0x0) at iofgets_u.c:50 50 iofgets_u.c: No such file or directory. (gdb) bt #0 0x00007ffff7db73d6 in __GI___fgets_unlocked (buf=buf@entry=0x5555555a8060 "", n=n@entry=4096, fp=fp@entry=0x0) at iofgets_u.c:50 #1 0x00007ffff7e2ef16 in __GI___getmntent_r (stream=stream@entry=0x0, mp=mp@entry=0x7ffff7efb140 , buffer=0x5555555a8060 "", bufsiz=bufsiz@entry=4096) at mntent_r.c:130 #2 0x00007ffff7e2ed03 in getmntent (stream=stream@entry=0x0) at mntent.c:52 #3 0x000055555555cfa8 in next_mnt (v=0x555555571ac8 , p=0x5555555a4ff8 "/") at cache.c:383 #4 nfsd_fh (f=6) at cache.c:736 #5 0x000055555555d2cd in cache_process_req (readfds=readfds@entry=0x7fffffffe960) at cache.c:1424 #6 0x000055555555d718 in my_svc_run () at svc_run.c:117 #7 0x0000555555558ee3 in main (argc=, argv=) at mountd.c:894 

Hmm.

 367 /* Iterate through /etc/mtab, finding mountpoints 368 * at or below a given path 369 */ 370 static char *next_mnt(void **v, char *p) 371 { 372 FILE *f; 373 struct mntent *me; 374 size_t l = strlen(p); 375 if (*v == NULL) { 376 f = setmntent("/etc/mtab", "r"); 377 *v = f; 378 } else 379 f = *v; 380 while ((me = getmntent(f)) != NULL && l > 1 && 381 (strncmp(me->mnt_dir, p, l) != 0 || 382 me->mnt_dir[l] != '/')) 383 ; 384 if (me == NULL) { 385 endmntent(f); 386 *v = NULL; 387 return NULL; 388 } 389 return me->mnt_dir; 390 } 

Hmmmmm.

So... setmntent() is returning NULL, that's being assigned to f, then getmntent(f) is... segfaulting...?

Let's have a look.

$ cat setmntent.c #include <stdio.h> #include <mntent.h> int main() { FILE *ret = setmntent("/etc/mtab", "r"); printf("%p\n", (void *)ret); return 0; } $ gcc -o setmntent setmntent.c $ ./setmntent (nil) 

Yay. So on my system, because /etc/mtab is a symlink, setmntent() asplodes. Great.

And let me guess...

$ cat getmntent.c #include <mntent.h> int main() { getmntent(NULL); } $ gcc -o getmntent getmntent.c $ ./getmntent Segmentation fault 

Well that's thoroughly logical, then. *Headdesk*

I just want NFS working. Wat do?

1
  • First piece of potential interestingness: my laptop, which uses ext4, and also has /etc/mtab symlinked to ../proc/self/mounts, does not exhibit this problem with the setmntent test program. My server is however using ZFS. Something to potentially investigate further. Commented Jul 12, 2021 at 17:24

1 Answer 1

0

Horrible-ish workaround:

rm /etc/mtab; cat /proc/mounts > /etc/mtab

Caveat emptor:

This may randomly break things that expect /etc/mtab to never, ever go away for even a nanosecond

Probably don't use this to fix production without plenty of spare duct tape (time, resources) available


I'm not marking this (self-answer) accepted, because I consider it a hack. Rather, I seek other comments/answers discussing

  • why my system has /etc/mtab pointing at /proc/self/mounts, given that this NFS issue seems 100% reproducible and whould theoretically(?) be blowing up everywhere

  • why this is happening at all

  • what I should do to do canonically fix this situation (particularly with configuring /etc/mtab sanely)

Such info would be highly appreciated!

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.