MySQL PXC node failing to receive state

Question

I have three nodes that I want to setup into a Percona XtraDB Cluster (PXC). I have bootstrapped the first node and joined the second node, but cannot somehow join the third node. All configuration is the same as I just did copy and paste:

[mysqld] # Galera wsrep_cluster_address = gcomm://10.1.5.100,10.1.5.101,10.1.5.102 wsrep_cluster_name = db-test wsrep_provider = /usr/lib/libgalera_smm.so wsrep_provider=/usr/lib64/galera3/libgalera_smm.so wsrep_provider_options = "gcache.size=256M" wsrep_slave_threads = 16 # 2~3 times with CPU wsrep_sst_auth = "sstuser:sstPwd#123" wsrep_sst_method = xtrabackup-v2

I am running the nodes on CentOS 7.x. Below is the status of the two PXC nodes already up and running:

| wsrep_ist_receive_seqno_end | 0 | | wsrep_incoming_addresses | 10.1.5.100:3306,10.1.5.101:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0/0/0/0/0 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 8d59ca0f-cd35-11e8-863c-d79869fa6d80 | | wsrep_cluster_conf_id | 4 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | ac97f711-cad5-11e8-8f39-be9d0594cdb9 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 0 | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <[email protected]> | | wsrep_provider_version | 3.31(rf216443) | | wsrep_ready | ON | +----------------------------------+-----------------------------------------+ 71 rows in set (0.01 sec)

Below is the error from the error log of the third node failing to join:

backup-v2|10.1.5.102:4444/xtrabackup_sst//1 2018-10-11T09:20:03.278884-00:00 2 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 2) (Increment: 1 -> 3) 2018-10-11T09:20:03.278997-00:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2018-10-11T09:20:03.279155-00:00 2 [Note] WSREP: Assign initial position for certification: 69, protocol version: 4 2018-10-11T09:20:03.279626-00:00 0 [Note] WSREP: Service thread queue flushed. 2018-10-11T09:20:03.280052-00:00 2 [Note] WSREP: Check if state gap can be serviced using IST 2018-10-11T09:20:03.280145-00:00 2 [Note] WSREP: Local state seqno is undefined (-1) 2018-10-11T09:20:03.280445-00:00 2 [Note] WSREP: State gap can't be serviced using IST. Switching to SST 2018-10-11T09:20:03.280510-00:00 2 [Note] WSREP: Failed to prepare for incremental state transfer: Local state seqno is undefined: 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():549. IST will be unavailable. 2018-10-11T09:20:03.287673-00:00 0 [Note] WSREP: Member 1.0 (db-test-3.pd.local) requested state transfer from '*any*'. Selected 0.0 (db-test-2.pd.local)(SYNCED) as donor. 2018-10-11T09:20:03.287850-00:00 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 69) 2018-10-11T09:20:03.288073-00:00 2 [Note] WSREP: Requesting state transfer: success, donor: 0 2018-10-11T09:20:03.288225-00:00 2 [Note] WSREP: GCache history reset: ac97f711-cad5-11e8-8f39-be9d0594cdb9:0 -> ac97f711-cad5-11e8-8f39-be9d0594cdb9:69 2018-10-11T09:20:38.988120-00:00 0 [Warning] WSREP: 0.0 (db-test-2.pd.local): State transfer to 1.0 (db-test-3.pd.local) failed: -32 (Broken pipe) 2018-10-11T09:20:38.988274-00:00 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():766: Will never receive state. Need to abort. 2018-10-11T09:20:38.988366-00:00 0 [Note] WSREP: gcomm: terminating thread 2018-10-11T09:20:38.988493-00:00 0 [Note] WSREP: gcomm: joining thread 2018-10-11T09:20:38.988942-00:00 0 [Note] WSREP: gcomm: closing backend 2018-10-11T09:20:38.995070-00:00 0 [Note] WSREP: Current view of cluster as seen by this node view (view_id(NON_PRIM,8d59ca0f,3) memb { d3167260,0 } joined { } left { } partitioned { 8d59ca0f,0 e3def063,0 } ) 2018-10-11T09:20:38.995334-00:00 0 [Note] WSREP: Current view of cluster as seen by this node view ((empty)) 2018-10-11T09:20:38.996612-00:00 0 [Note] WSREP: gcomm: closed 2018-10-11T09:20:38.996837-00:00 0 [Note] WSREP: /usr/sbin/mysqld: Terminated. Terminated 2018-10-11T09:20:47.767946+00:00 WSREP_SST: [ERROR] Removing /var/lib/mysql//xtrabackup_galera_info file due to signal 2018-10-11T09:20:47.788109+00:00 WSREP_SST: [ERROR] Removing file due to signal 2018-10-11T09:20:47.808425+00:00 WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 2018-10-11T09:20:47.818240+00:00 WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 143 143 2018-10-11T09:20:47.828411+00:00 WSREP_SST: [ERROR] ****************************************************** 2018-10-11T09:20:47.840006+00:00 WSREP_SST: [ERROR] Cleanup after exit with status:32

And below is the error from the node that was chosen as the donor:

2018/10/11 09:20:38 socat[22418] E connect(5, AF=2 10.1.5.102:4444, 16): No route to host 2018-10-11T09:20:38.805798+00:00 WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 2018-10-11T09:20:38.818683+00:00 WSREP_SST: [ERROR] Error while sending data to joiner node: exit codes: 0 1 2018-10-11T09:20:38.832059+00:00 WSREP_SST: [ERROR] ****************************************************** 2018-10-11T09:20:38.846813+00:00 WSREP_SST: [ERROR] Cleanup after exit with status:32 2018-10-11T09:20:38.985060-00:00 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.5.102:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.23-23-57' --binlog 'db-test-2-bin' --gtid 'ac97f711-cad5-11e8-8f39-be9d0594cdb9:69' : 32 (Broken pipe) 2018-10-11T09:20:38.985552-00:00 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.5.102:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.23-23-57' --binlog 'db-test-2-bin' --gtid 'ac97f711-cad5-11e8-8f39-be9d0594cdb9:69' 2018-10-11T09:20:38.990613-00:00 0 [Warning] WSREP: 0.0 (db-test-2.pd.local): State transfer to 1.0 (db-test-3.pd.local) failed: -32 (Broken pipe) 2018-10-11T09:20:38.990815-00:00 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 69) 2018-10-11T09:20:38.997784-00:00 0 [Note] WSREP: declaring e3def063 at tcp://10.1.5.100:4567 stable 2018-10-11T09:20:38.997807-00:00 0 [Note] WSREP: Member 0.0 (db-test-2.pd.local) synced with group. 2018-10-11T09:20:38.998230-00:00 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 69) 2018-10-11T09:20:38.998277-00:00 0 [Note] WSREP: forgetting d3167260 (tcp://10.1.5.102:4567) 2018-10-11T09:20:38.998806-00:00 13 [Note] WSREP: Synchronized with group, ready for connections 2018-10-11T09:20:38.999112-00:00 13 [Note] WSREP: Setting wsrep_ready to true 2018-10-11T09:20:38.999198-00:00 13 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2018-10-11T09:20:39.003491-00:00 0 [Note] WSREP: Node 8d59ca0f state primary 2018-10-11T09:20:39.005025-00:00 0 [Note] WSREP: Current view of cluster as seen by this node view (view_id(PRIM,8d59ca0f,4) memb { 8d59ca0f,0 e3def063,0 } joined { } left { } partitioned { d3167260,0 } ) 2018-10-11T09:20:39.005270-00:00 0 [Note] WSREP: Save the discovered primary-component to disk 2018-10-11T09:20:39.009691-00:00 0 [Note] WSREP: forgetting d3167260 (tcp://10.1.5.102:4567) 2018-10-11T09:20:39.010097-00:00 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2 2018-10-11T09:20:39.011037-00:00 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 2018-10-11T09:20:39.019171-00:00 0 [Note] WSREP: STATE EXCHANGE: sent state msg: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 2018-10-11T09:20:39.021665-00:00 0 [Note] WSREP: STATE EXCHANGE: got state msg: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 from 0 (db-test-2.pd.local) 2018-10-11T09:20:39.021786-00:00 0 [Note] WSREP: STATE EXCHANGE: got state msg: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 from 1 (db-test-1.pd.local) 2018-10-11T09:20:39.021861-00:00 0 [Note] WSREP: Quorum results: version = 4, component = PRIMARY, conf_id = 3, members = 2/2 (primary/total), act_id = 69, last_appl. = 0, protocols = 0/9/3 (gcs/repl/appl), group UUID = ac97f711-cad5-11e8-8f39-be9d0594cdb9 2018-10-11T09:20:39.021999-00:00 0 [Note] WSREP: Flow-control interval: [141, 141] 2018-10-11T09:20:39.022058-00:00 0 [Note] WSREP: Trying to continue unpaused monitor 2018-10-11T09:20:39.022774-00:00 17 [Note] WSREP: REPL Protocols: 9 (4, 2) 2018-10-11T09:20:39.023163-00:00 17 [Note] WSREP: New cluster view: global state: ac97f711-cad5-11e8-8f39-be9d0594cdb9:69, view# 4: Primary, number of nodes: 2, my index: 0, protocol version 3 2018-10-11T09:20:39.023209-00:00 17 [Note] WSREP: Setting wsrep_ready to true 2018-10-11T09:20:39.023256-00:00 17 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 3 -> 2) 2018-10-11T09:20:39.023373-00:00 17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2018-10-11T09:20:39.023540-00:00 17 [Note] WSREP: Assign initial position for certification: 69, protocol version: 4 2018-10-11T09:20:39.023832-00:00 0 [Note] WSREP: Service thread queue flushed. 2018-10-11T09:20:44.480289-00:00 0 [Note] WSREP: cleaning up d3167260 (tcp://10.1.5.102:4567)

When I bootstrap the third not to be its own cluster, it runs just fine. But when I try to stop the first two nodes in the other cluster and attempt to have them join the new cluster, they fail to join. I can ping and telnet the first two clusters nodes from the third node and vice versa. I even tried stopping all nodes and bootstrapped the cluster from scratch, and that did not help.

What is really going on here?

jynus · Accepted Answer · 2018-10-11 21:02:45Z

First of all, thanks for providing enough debug information, not everybody does that.

Your SST (data copy) is failing. Apparently, netcat is failing with "no route to host" error- that tells you that the new host is unreachable from the donor you paste. This is not really a cluster configuration issue, but an os/network one -your port may be closed, firewall up, or other network issue. Try to ping the other host from the donor or run a test netcat on the 4444 port to debug the breakage. Once the host is reachable, your sst should succeed and the node join the cluster. Usually it is some silly mistake like the firewall being up on one of the used port, wrong datadir permissions, wrong user, etc.

You can try changing the sst method to a different one to help debuging ( it only uses the mysql port, so it is simpler), if it is a test setup.

When I run "nc 10.1.4.102 4444" from the first tow servers already connected to PXC, I get "Ncat: No route to host" error, but when I run the "nc" command from the failing node to the nodes in the PXC, it says connection refused, which is an indication that the two nodes can be reached. So it seems like a firewall problem like you suggested. Disabling the firewall worked. Thanks @jynus — The Georgia
– The Georgia, Commented Oct 12, 2018 at 9:04

Stack Exchange Network

MySQL PXC node failing to receive state

1 Answer 1

Hot Network Questions

MySQL PXC node failing to receive state

1 Answer 1

Related

Hot Network Questions