6

I recently broke replication and when I tried to get past the one incorrect transaction. I got the following.

MariaDB [(none)]> STOP SLAVE; Query OK, 0 rows affected (0.05 sec) MariaDB [(none)]> SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1; ERROR 1966 (HY000): When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter cannot be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position. MariaDB [(none)]> select @@gtid_slave_pos; +---------------------------------------------+ | @@gtid_slave_pos | +---------------------------------------------+ | 0-1051-1391406,1-1050-1182069,57-1051-98897 | +---------------------------------------------+ 1 row in set (0.00 sec) MariaDB [(none)]> show variables like '%_pos%'; +----------------------+---------------------------------------------------------+ | Variable_name | Value | +----------------------+---------------------------------------------------------+ | gtid_binlog_pos | 0-1051-1391406,2-1051-4474,57-1051-98897 | | gtid_current_pos | 0-1051-1391406,1-1050-1182069,2-1051-4474,57-1051-98897 | | gtid_slave_pos | 0-1051-1391406,1-1050-1182069,57-1051-98897 | | wsrep_start_position | 00000000-0000-0000-0000-000000000000:-1 | +----------------------+---------------------------------------------------------+ 

What do I need to do to fix this.

Update 1

MariaDB [(none)]> show variables like '%gtid%'; +------------------------+------------------------------------------+ | Variable_name | Value | +------------------------+------------------------------------------+ | gtid_binlog_pos | 1-1050-4820789,2-1051-379101,3-1010-3273 | | gtid_binlog_state | 1-1050-4820789,2-1051-379101,3-1010-3273 | | gtid_current_pos | 1-1050-4819948,2-1051-379101,3-1010-3273 | | gtid_domain_id | 3 | | gtid_ignore_duplicates | OFF | | gtid_seq_no | 0 | | gtid_slave_pos | 1-1050-4819948,2-1051-379101,3-1010-3273 | | gtid_strict_mode | OFF | | last_gtid | | | wsrep_gtid_domain_id | 0 | | wsrep_gtid_mode | OFF | +------------------------+------------------------------------------+ 

I tried the following as per the instructions to set the @@gtid_slave_pos;

MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: [redacted] Master_User: [redacted] Master_Port: 3306 Connect_Retry: 5 Master_Log_File: binary.000591 Read_Master_Log_Pos: 526511543 Relay_Log_File: tmsdb-relay-bin.001239 Relay_Log_Pos: 4 Relay_Master_Log_File: binary.000591 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1062 Last_Error: Could not execute Write_rows_v1 event on table [redacted] Duplicate entry '1134890' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binary.000591, end_log_pos 60726493 Skip_Counter: 0 Exec_Master_Log_Pos: 60724897 Relay_Log_Space: 465787660 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1062 Last_SQL_Error: Could not execute Write_rows_v1 event on table [redacted] Duplicate entry '1134890' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log binary.000591, end_log_pos 60726493 Replicate_Ignore_Server_Ids: Master_Server_Id: 1050 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Current_Pos Gtid_IO_Pos: 1-1050-4827753,2-1051-379101,3-1010-3273 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic 1 row in set (0.00 sec) 

Using the gtid_slave_pos varialbe

MariaDB [(none)]> select @@gtid_slave_pos\G; *************************** 1. row *************************** @@gtid_slave_pos: 1-1050-4819948,2-1051-379101,3-1010-3273 MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.21 sec) MariaDB [(none)]> SET GLOBAL gtid_slave_pos='1-1050-4819948,2-1051-379101,3-1010-3274'; Query OK, 0 rows affected (0.10 sec) MariaDB [(none)]> start slave; Query OK, 0 rows affected (0.21 sec) 

When I check the status after running the above Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 3-1010-3274, which is not in the master's binlog'

MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Master_Host: 10.56.228.64 Master_User: maxscale Master_Port: 3306 Connect_Retry: 5 Master_Log_File: binary.000591 Read_Master_Log_Pos: 60724897 Relay_Log_File: tmsdb-relay-bin.001239 Relay_Log_Pos: 4 Relay_Master_Log_File: binary.000591 Slave_IO_Running: No Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 60724897 Relay_Log_Space: 249 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 3-1010-3274, which is not in the master's binlog' Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 1050 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Current_Pos Gtid_IO_Pos: 1-1050-4819948,2-1051-379101,3-1010-3274 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: optimistic 1 row in set (0.00 sec) 

I can get this back to the previous state by

MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.01 sec) MariaDB [(none)]> SET GLOBAL gtid_slave_pos='1-1050-4819948,2-1051-379101,3-1010-3273'; Query OK, 0 rows affected (0.09 sec) MariaDB [(none)]> start slave; Query OK, 0 rows affected (0.06 sec) 
0

2 Answers 2

4

I found the following worked for me. This does not restore a slave into state that is an exact replica of master. There will be data differences. I will use pt-table-sync to fix those.

1. Restart Replication without GTID method
2. Stop Parallel slave threads
3. Enable GTID replication
4. Using percona-toolkit pt-slave-restart to skip past all the errors.

1. Restart Replication without GTID method Using master binglog position

CHANGE MASTER TO MASTER_HOST='12.34.56.789',MASTER_USER='slave_user', MASTER_PASSWORD='password', MASTER_LOG_FILE='mysql-bin.000001', MASTER_LOG_POS= 107; 

This is well documented, Please google and find instructions.

2. Stop Parallel slave threads

This was part of the problem as seen in the original question.

ERROR 1966 (HY000): When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter cannot be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position.

I want to be able to skip events and not worry about trying to figure out or increase the GTID position for everyone.

MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.35 sec) MariaDB [(none)]> set global slave_parallel_threads = 0; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> set global slave_parallel_mode = none; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> Start SLAVE; Query OK, 0 rows affected (0.00 sec) 

Now if I check Parallel slave threads I see

MariaDB [(none)]> show slave status \G *************************** 1. row *************************** .......... Parallel_Mode: none 

I can reverse this process to re-enable Parallel slave threads when I am done. And I know that GTID is working.

3. Enable GTID replication

I can now try restarting the slave with GTID enabled.

On the master

MariaDB [(none)]> SHOW MASTER STATUS\G *************************** 1. row *************************** File: mariadb-bin.000001 Position: 510 Binlog_Do_DB: Binlog_Ignore_DB: 1 row in set (0.00 sec) SELECT BINLOG_GTID_POS('mariadb-bin.000001', 510); +--------------------------------------------+ | BINLOG_GTID_POS('mariadb-bin.000001', 510) | +--------------------------------------------+ | 1-101-1 | +--------------------------------------------+ 1 row in set (0.00 sec) 

On the slave

STOP SLAVE; SET GLOBAL gtid_slave_pos = '1-101-1'; CHANGE MASTER TO master_use_gtid=slave_pos; START SLAVE; 

Now when I check the slave it has some events to skip to get back into the same state as the master.

Last_Error: An attempt was made to binlog GTID 1-1050-5004291 which would create an out-of-order sequence number with existing GTID 1-1050-5004322, and gtid strict mode is enabled.

MariaDB [(none)]> show slave status \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Log_File: binary.000599 Read_Master_Log_Pos: 364810491 Relay_Log_File: tmsdb-relay-bin.001240 Relay_Log_Pos: 716 Relay_Master_Log_File: binary.000599 Slave_IO_Running: Yes Slave_SQL_Running: No Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 1950 Last_Error: An attempt was made to binlog GTID 1-1050-5004291 which would create an out-of-order sequence number with existing GTID 1-1050-5004322, and gtid strict mode is enabled. Skip_Counter: 0 Exec_Master_Log_Pos: 286447058 Relay_Log_Space: 78364447 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: NULL Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 1950 Last_SQL_Error: An attempt was made to binlog GTID 1-1050-5004291 which would create an out-of-order sequence number with existing GTID 1-1050-5004322, and gtid strict mode is enabled. Replicate_Ignore_Server_Ids: Master_Server_Id: 1050 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 1-1050-5005223,2-1051-379101,3-1010-3273 Replicate_Do_Domain_Ids: Replicate_Ignore_Domain_Ids: Parallel_Mode: none 1 row in set (0.00 sec) 

4. Using percona-toolkit pt-slave-restart to skip past all the errors

sudo yum install http://www.percona.com/downloads/percona-release/redhat/0.1-4/percona-release-0.1-4.noarch.rpm sudo yum search percona-toolkit 

pt-slave-restart will skip all the events need to get the slave into a working state.

# pt-slave-restart 2017-12-22T13:39:59 tmsdb-relay-bin.001240 716 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 69702 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 97912 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 98144 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 363903 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 364135 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 712776 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 713008 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 759737 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 827932 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 828164 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 934851 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 952088 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 952320 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1084249 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1084481 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1351188 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1351420 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1621561 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1693920 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1711677 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1711909 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1880931 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1881163 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 1916544 1950 2017-12-22T13:40:00 tmsdb-relay-bin.001240 2124672 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2124904 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2125136 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2452030 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2452262 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2819749 1950 2017-12-22T13:40:01 tmsdb-relay-bin.001240 2819981 1950 

Now when I check my slave status

MariaDB [(none)]> show slave status \G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: masterhost Master_User: maxscale Master_Port: 3306 Connect_Retry: 5 Master_Log_File: binary.000600 Read_Master_Log_Pos: 37801368 Relay_Log_File: tmsdb-relay-bin.001242 Relay_Log_Pos: 37801653 Relay_Master_Log_File: binary.000600 Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 37801368 Relay_Log_Space: 37801991 Until_Condition: None Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Master_Server_Id: 1050 Using_Gtid: Slave_Pos Gtid_IO_Pos: 1-1050-5014401,2-1051-379101,3-1010-3273 Parallel_Mode: none 1 row in set (0.00 sec) 

Lastly I need to restart the server and make sure it is reboot safe, etc.

2
  • 1
    This one is a keeper. +1 !!! Commented Jan 4, 2018 at 13:36
  • See my second answer below, as it is more succinct and has an actual fix. dba.stackexchange.com/a/197477/15291 Commented Apr 20, 2018 at 5:56
1

I have found in production that Parallel_Mode is the most likely cause of my problems.

I recommend using a different value from optimistic

MariaDB [(none)]> select @@slave_parallel_mode\G *************************** 1. row *************************** @@slave_parallel_mode: optimistic 

If you get the following errors.

pt-slave-restart 2018-02-09T10:39:19 tmsdb-relay-bin.000388 4 1032 DBD::mysql::st execute failed: When using parallel replication and GTID with multiple replication domains, @@sql_slave_skip_counter can not be used. Instead, setting @@gtid_slave_pos explicitly can be used to skip to after a given GTID position. [for Statement "SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1"] at /bin/pt-slave-restart line 5122. 

In the logs I see the following:

tail /var/log/mariadb.log 2018-02-09 10:35:46 139919003784960 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table [tablename]; Can't find record in '[tablename]', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log binary.000953, end_log_pos 264325215, Gtid 1-1050-13462991, Internal MariaDB error code: 1032 2018-02-09 10:35:46 139919003784960 [Warning] Slave: Can't find record in '[tablename]' Error_code: 1032 2018-02-09 10:35:46 139919003784960 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binary.000953' position 262879171; GTID position '1-1050-13462990,2-1051-379101,3-1010-3273' 2018-02-09 10:35:46 139918776985344 [Note] Slave SQL thread exiting, replication stopped in log 'binary.000953' at position 262879171; GTID position '1-1050-13462990,2-1051-379101,3-1010-3273' 

To restart the slave after it fails you can do the following.
Stop all slave_parallel_threads and disable slave_parallel_mode

MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.35 sec) MariaDB [(none)]> set global slave_parallel_threads = 0; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> set global slave_parallel_mode = none; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> Start SLAVE; Query OK, 0 rows affected (0.00 sec) 

I now use pt-slave-restart to restart slaves as I don't have to think about sequence number and a whole bundle of other things that take too long when I just want to get the slave started.

pt-slave-restart 

Will run without errors, you can ctrl-c to close it when you are happy that your slave has caught up.

This is not much different then, but it does it auto magically.

STOP SLAVE; SET GLOBAL sql_slave_skip_counter = 1; START SLAVE; 

If you need to have parallel threads then you can re-enable them once the slave has caught up or gotten past the event causing problems. I would try a different slave_parallel_mod like conservative

MariaDB [(none)]> stop slave; Query OK, 0 rows affected (0.01 sec) MariaDB [(none)]> set global slave_parallel_threads = 4; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> set global slave_parallel_mode = conservative; Query OK, 0 rows affected (0.00 sec) MariaDB [(none)]> start slave; Query OK, 0 rows affected (0.09 sec) 

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.