1

I have a 3 node Patroni2.1.1 + PGSQL13.4 cluster and one of the nodes is acting weird once it finished the basebackup bootstrap. I have another identical node working fine when put thru the same process.

Basically once Patroni tries to connect to localhost:5432 the server seems to die and refork (from my understanding). You'll see below the "loop" starting at connection received: host=::1 port=XXXXX.

2021-11-01 12:01:43.395 EDT [242925] LOG: starting PostgreSQL 13.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.4.1 20200928 (Red Hat 8.4.1-1), 64-bit 2021-11-01 12:01:43.396 EDT [242925] LOG: listening on IPv6 address "::1", port 5432 2021-11-01 12:01:43.396 EDT [242925] LOG: listening on IPv4 address "127.0.0.1", port 5432 2021-11-01 12:01:43.397 EDT [242925] LOG: listening on IPv4 address "192.168.23.13", port 5432 2021-11-01 12:01:43.398 EDT [242925] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2021-11-01 12:01:43.401 EDT [242925] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2021-11-01 12:01:43.405 EDT [242928] LOG: database system was interrupted; last known up at 2021-11-01 12:01:25 EDT 2021-11-01 12:01:43.405 EDT [242928] DEBUG: removing all temporary WAL segments 2021-11-01 12:01:43.436 EDT [242928] DEBUG: executing restore command "cp /var/lib/pgsql/13/archive/00000077.history pg_wal/RECOVERYHISTORY" cp: cannot stat '/var/lib/pgsql/13/archive/00000077.history': No such file or directory 2021-11-01 12:01:43.440 EDT [242928] DEBUG: could not restore file "00000077.history" from archive: child process exited with exit code 1 2021-11-01 12:01:43.441 EDT [242928] LOG: entering standby mode 2021-11-01 12:01:43.441 EDT [242928] DEBUG: backup time 2021-11-01 12:01:25 EDT in file "backup_label" 2021-11-01 12:01:43.441 EDT [242928] DEBUG: backup label pg_basebackup base backup in file "backup_label" 2021-11-01 12:01:43.441 EDT [242928] DEBUG: backup timeline 118 in file "backup_label" 2021-11-01 12:01:43.441 EDT [242928] DEBUG: executing restore command "cp /var/lib/pgsql/13/archive/00000076.history pg_wal/RECOVERYHISTORY" 2021-11-01 12:01:43.538 EDT [242928] LOG: restored log file "00000076.history" from archive 2021-11-01 12:01:43.540 EDT [242928] DEBUG: executing restore command "cp /var/lib/pgsql/13/archive/000000760000000100000000 pg_wal/RECOVERYXLOG" 2021-11-01 12:01:43.689 EDT [242928] LOG: restored log file "000000760000000100000000" from archive 2021-11-01 12:01:43.692 EDT [242928] DEBUG: got WAL segment from archive 2021-11-01 12:01:43.693 EDT [242928] DEBUG: checkpoint record is at 1/60 2021-11-01 12:01:43.693 EDT [242928] DEBUG: redo record is at 1/28; shutdown false 2021-11-01 12:01:43.693 EDT [242928] DEBUG: next transaction ID: 531; next OID: 24606 2021-11-01 12:01:43.693 EDT [242928] DEBUG: next MultiXactId: 1; next MultiXactOffset: 0 2021-11-01 12:01:43.693 EDT [242928] DEBUG: oldest unfrozen transaction ID: 478, in database 1 2021-11-01 12:01:43.693 EDT [242928] DEBUG: oldest MultiXactId: 1, in database 1 2021-11-01 12:01:43.693 EDT [242928] DEBUG: commit timestamp Xid oldest/newest: 0/0 2021-11-01 12:01:43.693 EDT [242928] DEBUG: transaction ID wrap limit is 2147484125, limited by database with OID 1 2021-11-01 12:01:43.693 EDT [242928] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 2021-11-01 12:01:43.693 EDT [242928] DEBUG: starting up replication slots 2021-11-01 12:01:43.693 EDT [242928] DEBUG: starting up replication origin progress state 2021-11-01 12:01:43.697 EDT [242928] DEBUG: resetting unlogged relations: cleanup 1 init 0 2021-11-01 12:01:43.698 EDT [242928] DEBUG: initializing for hot standby 2021-11-01 12:01:43.698 EDT [242928] DEBUG: my backend ID is 1 2021-11-01 12:01:43.698 EDT [242928] LOG: redo starts at 1/28 2021-11-01 12:01:43.698 EDT [242928] DEBUG: prune KnownAssignedXids to 531 2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/28 for Standby/RUNNING_XACTS: nextXid 531 latestCompletedXid 530 oldestRunningXid 531 2021-11-01 12:01:43.698 EDT [242928] DEBUG: 0 KnownAssignedXids (num=0 tail=0 head=0) 2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/28 for Standby/RUNNING_XACTS: nextXid 531 latestCompletedXid 530 oldestRunningXid 531 2021-11-01 12:01:43.698 EDT [242928] DEBUG: recovery snapshots are now enabled 2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/28 for Standby/RUNNING_XACTS: nextXid 531 latestCompletedXid 530 oldestRunningXid 531 2021-11-01 12:01:43.698 EDT [242928] DEBUG: prune KnownAssignedXids to 531 2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/D8 for Standby/RUNNING_XACTS: nextXid 531 latestCompletedXid 530 oldestRunningXid 531 2021-11-01 12:01:43.698 EDT [242928] DEBUG: end of backup reached 2021-11-01 12:01:43.698 EDT [242928] CONTEXT: WAL redo at 1/110 for XLOG/BACKUP_END: 1/28 2021-11-01 12:01:43.699 EDT [242932] DEBUG: checkpointer updated shared memory configuration values 2021-11-01 12:01:43.700 EDT [242928] LOG: consistent recovery state reached at 1/138 2021-11-01 12:01:43.700 EDT [242928] DEBUG: executing restore command "cp /var/lib/pgsql/13/archive/000000760000000100000001 pg_wal/RECOVERYXLOG" 2021-11-01 12:01:43.700 EDT [242925] LOG: database system is ready to accept read only connections cp: cannot stat '/var/lib/pgsql/13/archive/000000760000000100000001': No such file or directory 2021-11-01 12:01:43.704 EDT [242928] DEBUG: could not restore file "000000760000000100000001" from archive: child process exited with exit code 1 2021-11-01 12:01:43.704 EDT [242928] DEBUG: could not open file "pg_wal/000000760000000100000001": No such file or directory 2021-11-01 12:01:43.704 EDT [242928] DEBUG: switched WAL source from archive to stream after failure 2021-11-01 12:01:43.705 EDT [242936] DEBUG: find_in_dynamic_libpath: trying "/usr/pgsql-13/lib/libpqwalreceiver" 2021-11-01 12:01:43.706 EDT [242936] DEBUG: find_in_dynamic_libpath: trying "/usr/pgsql-13/lib/libpqwalreceiver.so" 2021-11-01 12:01:44.370 EDT [242925] DEBUG: forked new backend, pid=242938 socket=12 2021-11-01 12:01:44.371 EDT [242938] LOG: connection received: host=::1 port=36714 2021-11-01 12:01:51.235 EDT [242938] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2021-11-01 12:01:51.235 EDT [242938] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2021-11-01 12:01:51.235 EDT [242938] DEBUG: proc_exit(0): 1 callbacks to make 2021-11-01 12:01:51.235 EDT [242938] DEBUG: exit(0) 2021-11-01 12:01:51.235 EDT [242938] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:01:51.235 EDT [242938] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:01:51.235 EDT [242938] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:01:51.236 EDT [242925] DEBUG: forked new backend, pid=242941 socket=12 2021-11-01 12:01:51.237 EDT [242941] LOG: connection received: host=127.0.0.1 port=43428 2021-11-01 12:01:51.237 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:01:51.237 EDT [242925] DEBUG: server process (PID 242938) exited with exit code 0 2021-11-01 12:01:58.284 EDT [242941] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2021-11-01 12:01:58.284 EDT [242941] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2021-11-01 12:01:58.284 EDT [242941] DEBUG: proc_exit(0): 1 callbacks to make 2021-11-01 12:01:58.284 EDT [242941] DEBUG: exit(0) 2021-11-01 12:01:58.284 EDT [242941] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:01:58.284 EDT [242941] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:01:58.284 EDT [242941] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:01:58.285 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:01:58.285 EDT [242925] DEBUG: server process (PID 242941) exited with exit code 0 2021-11-01 12:01:59.295 EDT [242925] DEBUG: forked new backend, pid=242944 socket=12 2021-11-01 12:01:59.295 EDT [242944] LOG: connection received: host=::1 port=36758 2021-11-01 12:02:06.301 EDT [242944] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:06.301 EDT [242944] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:06.301 EDT [242944] DEBUG: proc_exit(0): 1 callbacks to make 2021-11-01 12:02:06.301 EDT [242944] DEBUG: exit(0) 2021-11-01 12:02:06.301 EDT [242944] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:06.301 EDT [242944] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:06.301 EDT [242944] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:02:06.301 EDT [242925] DEBUG: forked new backend, pid=242948 socket=12 2021-11-01 12:02:06.302 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:02:06.302 EDT [242925] DEBUG: server process (PID 242944) exited with exit code 0 2021-11-01 12:02:06.302 EDT [242948] LOG: connection received: host=127.0.0.1 port=43470 2021-11-01 12:02:13.440 EDT [242948] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:13.440 EDT [242948] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:13.440 EDT [242948] DEBUG: proc_exit(0): 1 callbacks to make 2021-11-01 12:02:13.440 EDT [242948] DEBUG: exit(0) 2021-11-01 12:02:13.440 EDT [242948] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:13.440 EDT [242948] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:13.440 EDT [242948] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:02:13.441 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:02:13.441 EDT [242925] DEBUG: server process (PID 242948) exited with exit code 0 2021-11-01 12:02:14.148 EDT [242925] DEBUG: postmaster received signal 2 2021-11-01 12:02:14.148 EDT [242925] LOG: received fast shutdown request 2021-11-01 12:02:14.150 EDT [242925] LOG: aborting any active transactions 2021-11-01 12:02:14.150 EDT [242933] DEBUG: shmem_exit(0): 1 before_shmem_exit callbacks to make 2021-11-01 12:02:14.150 EDT [242933] DEBUG: shmem_exit(0): 5 on_shmem_exit callbacks to make 2021-11-01 12:02:14.150 EDT [242933] DEBUG: proc_exit(0): 2 callbacks to make 2021-11-01 12:02:14.150 EDT [242933] DEBUG: exit(0) 2021-11-01 12:02:14.150 EDT [242933] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:14.150 EDT [242933] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:14.150 EDT [242933] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:02:14.150 EDT [242928] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make 2021-11-01 12:02:14.150 EDT [242928] DEBUG: shmem_exit(1): 7 on_shmem_exit callbacks to make 2021-11-01 12:02:14.150 EDT [242928] DEBUG: removing all KnownAssignedXids 2021-11-01 12:02:14.150 EDT [242928] DEBUG: release all standby locks 2021-11-01 12:02:14.150 EDT [242928] DEBUG: proc_exit(1): 2 callbacks to make 2021-11-01 12:02:14.150 EDT [242928] DEBUG: exit(1) 2021-11-01 12:02:14.150 EDT [242928] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:14.150 EDT [242928] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:14.150 EDT [242928] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:02:14.151 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:02:14.151 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:02:14.449 EDT [242925] DEBUG: forked new backend, pid=242954 socket=12 2021-11-01 12:02:14.450 EDT [242954] LOG: connection received: host=::1 port=36798 2021-11-01 12:02:15.915 EDT [242936] FATAL: terminating walreceiver process due to administrator command 2021-11-01 12:02:15.915 EDT [242936] DEBUG: shmem_exit(1): 1 before_shmem_exit callbacks to make 2021-11-01 12:02:15.915 EDT [242936] DEBUG: shmem_exit(1): 6 on_shmem_exit callbacks to make 2021-11-01 12:02:15.915 EDT [242936] DEBUG: proc_exit(1): 2 callbacks to make 2021-11-01 12:02:15.915 EDT [242936] DEBUG: exit(1) 2021-11-01 12:02:15.916 EDT [242936] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:15.916 EDT [242936] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:15.916 EDT [242936] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:02:15.917 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:02:15.917 EDT [242932] LOG: shutting down 2021-11-01 12:02:15.917 EDT [242932] DEBUG: skipping restartpoint, already performed at 1/28 2021-11-01 12:02:15.919 EDT [242932] DEBUG: updated min recovery point to 1/1000000 on timeline 118 2021-11-01 12:02:15.920 EDT [242932] DEBUG: shmem_exit(0): 1 before_shmem_exit callbacks to make 2021-11-01 12:02:15.920 EDT [242932] DEBUG: shmem_exit(0): 5 on_shmem_exit callbacks to make 2021-11-01 12:02:15.920 EDT [242932] DEBUG: proc_exit(0): 2 callbacks to make 2021-11-01 12:02:15.920 EDT [242932] DEBUG: exit(0) 2021-11-01 12:02:15.920 EDT [242932] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:15.920 EDT [242932] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:15.920 EDT [242932] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:02:15.921 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:02:15.921 EDT [242934] DEBUG: writing stats file "pg_stat/global.stat" 2021-11-01 12:02:15.921 EDT [242934] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:15.921 EDT [242934] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:15.921 EDT [242934] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:02:15.922 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:02:17.890 EDT [242925] DEBUG: postmaster received signal 2 2021-11-01 12:02:18.617 EDT [242954] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:18.617 EDT [242954] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:18.617 EDT [242954] DEBUG: proc_exit(0): 1 callbacks to make 2021-11-01 12:02:18.617 EDT [242954] DEBUG: exit(0) 2021-11-01 12:02:18.617 EDT [242954] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:18.617 EDT [242954] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:18.617 EDT [242954] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:02:18.618 EDT [242925] DEBUG: reaping dead processes 2021-11-01 12:02:18.618 EDT [242925] DEBUG: server process (PID 242954) exited with exit code 0 2021-11-01 12:02:18.618 EDT [242925] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:18.618 EDT [242925] DEBUG: shmem_exit(0): 5 on_shmem_exit callbacks to make 2021-11-01 12:02:18.618 EDT [242925] DEBUG: cleaning up dynamic shared memory control segment with ID 2104592565 2021-11-01 12:02:18.631 EDT [242925] DEBUG: proc_exit(0): 2 callbacks to make 2021-11-01 12:02:18.631 EDT [242925] LOG: database system is shut down 2021-11-01 12:02:18.631 EDT [242925] DEBUG: exit(0) 2021-11-01 12:02:18.631 EDT [242925] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:18.631 EDT [242925] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:18.631 EDT [242925] DEBUG: proc_exit(-1): 0 callbacks to make 2021-11-01 12:02:18.633 EDT [242927] DEBUG: logger shutting down 2021-11-01 12:02:18.633 EDT [242927] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:18.633 EDT [242927] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:18.633 EDT [242927] DEBUG: proc_exit(0): 0 callbacks to make 2021-11-01 12:02:18.633 EDT [242927] DEBUG: exit(0) 2021-11-01 12:02:18.633 EDT [242927] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2021-11-01 12:02:18.633 EDT [242927] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2021-11-01 12:02:18.633 EDT [242927] DEBUG: proc_exit(-1): 0 callbacks to make 

If I stop Patroni, the server starts running fine and joins the replication as expected. The user postgres can connect just fine locally or remotely using the psql utility or even using pgadmin.

Could someone explain what's going on here? I'm still learning PGSQL and I'm not quite comfortable with those logs yet... Thank you!

For more context you can also look at Patroni issue #2104 I posted.

1 Answer 1

0

Ok finally found the cause! Using strace on patroni I found the following looping:

388084 sendmsg(5, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\35", iov_len=4}, {iov_base="\2\0\0\n26:94667\0\226\321\200\2732\253J\266\263\301\2\243\364\360\261\375", iov_len=29}], msg_iovlen=2, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 33 388084 read(5, "\0\0\0\267", 4) = 4 388084 read(5, "\0\0\0\0", 4) = 4 388084 read(5, "\0\0\0\0\0\0\0\1\0\0\0\2\0\0\0\vDOMAIN.COM\0\0\0\10P"..., 183) = 183 

Turns out an incorrect principal for the postgres user was still cached and after kdestroy -A were back in business.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.