I have three nodes that I want to setup into a Percona XtraDB Cluster (PXC). I have bootstrapped the first node and joined the second node, but cannot somehow join the third node. All configuration is the same as I just did copy and paste:
[mysqld] # Galera wsrep_cluster_address = gcomm://10.1.5.100,10.1.5.101,10.1.5.102 wsrep_cluster_name = db-test wsrep_provider = /usr/lib/libgalera_smm.so wsrep_provider=/usr/lib64/galera3/libgalera_smm.so wsrep_provider_options = "gcache.size=256M" wsrep_slave_threads = 16 # 2~3 times with CPU wsrep_sst_auth = "sstuser:sstPwd#123" wsrep_sst_method = xtrabackup-v2 I am running the nodes on CentOS 7.x. Below is the status of the two PXC nodes already up and running:
| wsrep_ist_receive_seqno_end | 0 | | wsrep_incoming_addresses | 10.1.5.100:3306,10.1.5.101:3306 | | wsrep_cluster_weight | 2 | | wsrep_desync_count | 0 | | wsrep_evs_delayed | | | wsrep_evs_evict_list | | | wsrep_evs_repl_latency | 0/0/0/0/0 | | wsrep_evs_state | OPERATIONAL | | wsrep_gcomm_uuid | 8d59ca0f-cd35-11e8-863c-d79869fa6d80 | | wsrep_cluster_conf_id | 4 | | wsrep_cluster_size | 2 | | wsrep_cluster_state_uuid | ac97f711-cad5-11e8-8f39-be9d0594cdb9 | | wsrep_cluster_status | Primary | | wsrep_connected | ON | | wsrep_local_bf_aborts | 0 | | wsrep_local_index | 0 | | wsrep_provider_name | Galera | | wsrep_provider_vendor | Codership Oy <[email protected]> | | wsrep_provider_version | 3.31(rf216443) | | wsrep_ready | ON | +----------------------------------+-----------------------------------------+ 71 rows in set (0.01 sec) Below is the error from the error log of the third node failing to join:
backup-v2|10.1.5.102:4444/xtrabackup_sst//1 2018-10-11T09:20:03.278884-00:00 2 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 2) (Increment: 1 -> 3) 2018-10-11T09:20:03.278997-00:00 2 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2018-10-11T09:20:03.279155-00:00 2 [Note] WSREP: Assign initial position for certification: 69, protocol version: 4 2018-10-11T09:20:03.279626-00:00 0 [Note] WSREP: Service thread queue flushed. 2018-10-11T09:20:03.280052-00:00 2 [Note] WSREP: Check if state gap can be serviced using IST 2018-10-11T09:20:03.280145-00:00 2 [Note] WSREP: Local state seqno is undefined (-1) 2018-10-11T09:20:03.280445-00:00 2 [Note] WSREP: State gap can't be serviced using IST. Switching to SST 2018-10-11T09:20:03.280510-00:00 2 [Note] WSREP: Failed to prepare for incremental state transfer: Local state seqno is undefined: 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():549. IST will be unavailable. 2018-10-11T09:20:03.287673-00:00 0 [Note] WSREP: Member 1.0 (db-test-3.pd.local) requested state transfer from '*any*'. Selected 0.0 (db-test-2.pd.local)(SYNCED) as donor. 2018-10-11T09:20:03.287850-00:00 0 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 69) 2018-10-11T09:20:03.288073-00:00 2 [Note] WSREP: Requesting state transfer: success, donor: 0 2018-10-11T09:20:03.288225-00:00 2 [Note] WSREP: GCache history reset: ac97f711-cad5-11e8-8f39-be9d0594cdb9:0 -> ac97f711-cad5-11e8-8f39-be9d0594cdb9:69 2018-10-11T09:20:38.988120-00:00 0 [Warning] WSREP: 0.0 (db-test-2.pd.local): State transfer to 1.0 (db-test-3.pd.local) failed: -32 (Broken pipe) 2018-10-11T09:20:38.988274-00:00 0 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():766: Will never receive state. Need to abort. 2018-10-11T09:20:38.988366-00:00 0 [Note] WSREP: gcomm: terminating thread 2018-10-11T09:20:38.988493-00:00 0 [Note] WSREP: gcomm: joining thread 2018-10-11T09:20:38.988942-00:00 0 [Note] WSREP: gcomm: closing backend 2018-10-11T09:20:38.995070-00:00 0 [Note] WSREP: Current view of cluster as seen by this node view (view_id(NON_PRIM,8d59ca0f,3) memb { d3167260,0 } joined { } left { } partitioned { 8d59ca0f,0 e3def063,0 } ) 2018-10-11T09:20:38.995334-00:00 0 [Note] WSREP: Current view of cluster as seen by this node view ((empty)) 2018-10-11T09:20:38.996612-00:00 0 [Note] WSREP: gcomm: closed 2018-10-11T09:20:38.996837-00:00 0 [Note] WSREP: /usr/sbin/mysqld: Terminated. Terminated 2018-10-11T09:20:47.767946+00:00 WSREP_SST: [ERROR] Removing /var/lib/mysql//xtrabackup_galera_info file due to signal 2018-10-11T09:20:47.788109+00:00 WSREP_SST: [ERROR] Removing file due to signal 2018-10-11T09:20:47.808425+00:00 WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 2018-10-11T09:20:47.818240+00:00 WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 143 143 2018-10-11T09:20:47.828411+00:00 WSREP_SST: [ERROR] ****************************************************** 2018-10-11T09:20:47.840006+00:00 WSREP_SST: [ERROR] Cleanup after exit with status:32 And below is the error from the node that was chosen as the donor:
2018/10/11 09:20:38 socat[22418] E connect(5, AF=2 10.1.5.102:4444, 16): No route to host 2018-10-11T09:20:38.805798+00:00 WSREP_SST: [ERROR] ******************* FATAL ERROR ********************** 2018-10-11T09:20:38.818683+00:00 WSREP_SST: [ERROR] Error while sending data to joiner node: exit codes: 0 1 2018-10-11T09:20:38.832059+00:00 WSREP_SST: [ERROR] ****************************************************** 2018-10-11T09:20:38.846813+00:00 WSREP_SST: [ERROR] Cleanup after exit with status:32 2018-10-11T09:20:38.985060-00:00 0 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.5.102:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.23-23-57' --binlog 'db-test-2-bin' --gtid 'ac97f711-cad5-11e8-8f39-be9d0594cdb9:69' : 32 (Broken pipe) 2018-10-11T09:20:38.985552-00:00 0 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.1.5.102:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --mysqld-version '5.7.23-23-57' --binlog 'db-test-2-bin' --gtid 'ac97f711-cad5-11e8-8f39-be9d0594cdb9:69' 2018-10-11T09:20:38.990613-00:00 0 [Warning] WSREP: 0.0 (db-test-2.pd.local): State transfer to 1.0 (db-test-3.pd.local) failed: -32 (Broken pipe) 2018-10-11T09:20:38.990815-00:00 0 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 69) 2018-10-11T09:20:38.997784-00:00 0 [Note] WSREP: declaring e3def063 at tcp://10.1.5.100:4567 stable 2018-10-11T09:20:38.997807-00:00 0 [Note] WSREP: Member 0.0 (db-test-2.pd.local) synced with group. 2018-10-11T09:20:38.998230-00:00 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 69) 2018-10-11T09:20:38.998277-00:00 0 [Note] WSREP: forgetting d3167260 (tcp://10.1.5.102:4567) 2018-10-11T09:20:38.998806-00:00 13 [Note] WSREP: Synchronized with group, ready for connections 2018-10-11T09:20:38.999112-00:00 13 [Note] WSREP: Setting wsrep_ready to true 2018-10-11T09:20:38.999198-00:00 13 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2018-10-11T09:20:39.003491-00:00 0 [Note] WSREP: Node 8d59ca0f state primary 2018-10-11T09:20:39.005025-00:00 0 [Note] WSREP: Current view of cluster as seen by this node view (view_id(PRIM,8d59ca0f,4) memb { 8d59ca0f,0 e3def063,0 } joined { } left { } partitioned { d3167260,0 } ) 2018-10-11T09:20:39.005270-00:00 0 [Note] WSREP: Save the discovered primary-component to disk 2018-10-11T09:20:39.009691-00:00 0 [Note] WSREP: forgetting d3167260 (tcp://10.1.5.102:4567) 2018-10-11T09:20:39.010097-00:00 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2 2018-10-11T09:20:39.011037-00:00 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 2018-10-11T09:20:39.019171-00:00 0 [Note] WSREP: STATE EXCHANGE: sent state msg: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 2018-10-11T09:20:39.021665-00:00 0 [Note] WSREP: STATE EXCHANGE: got state msg: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 from 0 (db-test-2.pd.local) 2018-10-11T09:20:39.021786-00:00 0 [Note] WSREP: STATE EXCHANGE: got state msg: eb0b1f21-cd36-11e8-8ac8-c60fb82759c9 from 1 (db-test-1.pd.local) 2018-10-11T09:20:39.021861-00:00 0 [Note] WSREP: Quorum results: version = 4, component = PRIMARY, conf_id = 3, members = 2/2 (primary/total), act_id = 69, last_appl. = 0, protocols = 0/9/3 (gcs/repl/appl), group UUID = ac97f711-cad5-11e8-8f39-be9d0594cdb9 2018-10-11T09:20:39.021999-00:00 0 [Note] WSREP: Flow-control interval: [141, 141] 2018-10-11T09:20:39.022058-00:00 0 [Note] WSREP: Trying to continue unpaused monitor 2018-10-11T09:20:39.022774-00:00 17 [Note] WSREP: REPL Protocols: 9 (4, 2) 2018-10-11T09:20:39.023163-00:00 17 [Note] WSREP: New cluster view: global state: ac97f711-cad5-11e8-8f39-be9d0594cdb9:69, view# 4: Primary, number of nodes: 2, my index: 0, protocol version 3 2018-10-11T09:20:39.023209-00:00 17 [Note] WSREP: Setting wsrep_ready to true 2018-10-11T09:20:39.023256-00:00 17 [Note] WSREP: Auto Increment Offset/Increment re-align with cluster membership change (Offset: 1 -> 1) (Increment: 3 -> 2) 2018-10-11T09:20:39.023373-00:00 17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2018-10-11T09:20:39.023540-00:00 17 [Note] WSREP: Assign initial position for certification: 69, protocol version: 4 2018-10-11T09:20:39.023832-00:00 0 [Note] WSREP: Service thread queue flushed. 2018-10-11T09:20:44.480289-00:00 0 [Note] WSREP: cleaning up d3167260 (tcp://10.1.5.102:4567) When I bootstrap the third not to be its own cluster, it runs just fine. But when I try to stop the first two nodes in the other cluster and attempt to have them join the new cluster, they fail to join. I can ping and telnet the first two clusters nodes from the third node and vice versa. I even tried stopping all nodes and bootstrapped the cluster from scratch, and that did not help.
What is really going on here?