Return to Revisions

9 of 15

added 1331 characters in body; edited title

edited Sep 4, 2014 at 15:07

MySQL instance stalling

Problem

An instance of MySQL 5.6.20 running (mostly just) a database with InnoDB tables is exhibiting occasional stalls for all update operations for the duration of 1-4 minutes with all INSERT, UPDATE and DELETE queries remaining in "Query end" state. This obviously is most unfortunate. The MySQL slow query log is logging even the most trivial queries with insane query times, hundreds of them with the same timestamp corresponding to the point in time where the stall has been resolved:

# Query_time: 101.743589 Lock_time: 0.000437 Rows_sent: 0 Rows_examined: 0 SET timestamp=1409573952; INSERT INTO sessions (redirect_login2, data, hostname, fk_users_primary, fk_users, id_sessions, timestamp) VALUES (NULL, NULL, '192.168.10.151', NULL, 'anonymous', '64ef367018099de4d4183ffa3bc0848a', '1409573850');

And the device statistics are showing increased, although not excessive I/O load in this time frame (in this case updates were stalling 14:17:30 - 14:19:12 according to the timestamps from the statement above):

# sar -d [...] 02:15:01 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util 02:16:01 PM dev8-0 41.53 207.43 1227.51 34.55 0.34 8.28 3.89 16.15 02:17:01 PM dev8-0 59.41 137.71 2240.32 40.02 0.39 6.53 4.04 24.00 02:18:01 PM dev8-0 122.08 2816.99 1633.44 36.45 3.84 31.46 1.21 2.88 02:19:01 PM dev8-0 253.29 5559.84 3888.03 37.30 6.61 26.08 1.85 6.73 02:20:01 PM dev8-0 101.74 1391.92 2786.41 41.07 1.69 16.57 3.55 36.17 [...] # sar [...] 02:15:01 PM CPU %user %nice %system %iowait %steal %idle 02:16:01 PM all 15.99 0.00 12.49 2.08 0.00 69.44 02:17:01 PM all 13.67 0.00 9.45 3.15 0.00 73.73 02:18:01 PM all 10.64 0.00 6.26 11.65 0.00 71.45 02:19:01 PM all 3.83 0.00 2.42 24.84 0.00 68.91 02:20:01 PM all 20.95 0.00 15.14 6.83 0.00 57.07

The nature of the stalls first made me suspect log flushing activity to be the culprit and this Percona article on log flushing performance issues with MySQL 5.5 is describing very similar symptoms, but further occurrences have shown that INSERT operations into the single MyISAM table in this database are affected by the stall as well, so this does not seem like an InnoDB-only issue.

Nonetheless, I decided to track the values of Log sequence number and Pages flushed up to from the "LOG" section outputs of SHOW ENGINE INNODB STATUS every 10 seconds. It indeed does look like flushing activity is ongoing during the stall as the spread between the two values is decreasing :

Mon Sep 1 14:17:08 CEST 2014 LSN: 263992263703, Pages flushed: 263973405075, Difference: 18416 K Mon Sep 1 14:17:19 CEST 2014 LSN: 263992826715, Pages flushed: 263973811282, Difference: 18569 K Mon Sep 1 14:17:29 CEST 2014 LSN: 263993160647, Pages flushed: 263974544320, Difference: 18180 K Mon Sep 1 14:17:39 CEST 2014 LSN: 263993539171, Pages flushed: 263974784191, Difference: 18315 K Mon Sep 1 14:17:49 CEST 2014 LSN: 263993785507, Pages flushed: 263975990474, Difference: 17377 K Mon Sep 1 14:17:59 CEST 2014 LSN: 263994298172, Pages flushed: 263976855227, Difference: 17034 K Mon Sep 1 14:18:09 CEST 2014 LSN: 263994670794, Pages flushed: 263978062309, Difference: 16219 K Mon Sep 1 14:18:19 CEST 2014 LSN: 263995014722, Pages flushed: 263983319652, Difference: 11420 K Mon Sep 1 14:18:30 CEST 2014 LSN: 263995404674, Pages flushed: 263986138726, Difference: 9048 K Mon Sep 1 14:18:40 CEST 2014 LSN: 263995718244, Pages flushed: 263988558036, Difference: 6992 K Mon Sep 1 14:18:50 CEST 2014 LSN: 263996129424, Pages flushed: 263988808179, Difference: 7149 K Mon Sep 1 14:19:00 CEST 2014 LSN: 263996517064, Pages flushed: 263992009344, Difference: 4402 K Mon Sep 1 14:19:11 CEST 2014 LSN: 263996979188, Pages flushed: 263993364509, Difference: 3529 K Mon Sep 1 14:19:21 CEST 2014 LSN: 263998880477, Pages flushed: 263993558842, Difference: 5196 K Mon Sep 1 14:19:31 CEST 2014 LSN: 264001013381, Pages flushed: 263993568285, Difference: 7270 K Mon Sep 1 14:19:41 CEST 2014 LSN: 264001933489, Pages flushed: 263993578961, Difference: 8158 K Mon Sep 1 14:19:51 CEST 2014 LSN: 264004225438, Pages flushed: 263993585459, Difference: 10390 K

And at 14:19:11 the spread has reached its minimum, so flushing activity seems to have ceased here, just coinciding with the end of the stall. But these points made me dismiss the InnoDB log flushing as the cause:

for the flushing operation to block all updates to the database it needs to be "synchronous", which means that 7/8 of the log space has to be occupied
it would be preceded by an "asynchronous" flushing phase starting at innodb_max_dirty_pages_pct fill level - which I am not seeing
the LSNs keep increasing even during the the stall, so log activity is not ceasing completely
MyISAM table INSERTs are affected as well
the page_cleaner thread for adaptive flushing seems to do its work and flush the logs without causing DML queries to stop:

LSN - PagesFlushed

^{(numbers are ([Log Sequence Number] - [Pages flushed up to]) / 1024 from SHOW ENGINE INNODB STATUS)}

The issue seems somewhat alleviated by setting innodb_adaptive_flushing_lwm=1, forcing the page cleaner to do more work than before.

The error.log has no entries coinciding with the stalls. SHOW INNODB STATUS excerpts after approximately 24 hours of operation look like this:

SEMAPHORES ---------- OS WAIT ARRAY INFO: reservation count 789330 OS WAIT ARRAY INFO: signal count 1424848 Mutex spin waits 269678, rounds 3114657, OS waits 65965 RW-shared spins 941620, rounds 20437223, OS waits 442474 RW-excl spins 451007, rounds 13254440, OS waits 215151 Spin rounds per wait: 11.55 mutex, 21.70 RW-shared, 29.39 RW-excl ------------------------ LATEST DETECTED DEADLOCK ------------------------ 2014-09-03 10:33:55 7fe0e2e44700 *** (1) TRANSACTION: TRANSACTION 148779867, ACTIVE 36 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 4 lock struct(s), heap size 1184, 4 row lock(s) MySQL thread id 34610, OS thread handle 0x7fe0e1fc8700, query id 16294030 5.199.138.162 root updating [...] -------- FILE I/O -------- [...] 932635 OS file reads, 2117126 OS file writes, 1193633 OS fsyncs 0.00 reads/s, 0 avg bytes/read, 17.00 writes/s, 1.20 fsyncs/s -------------- ROW OPERATIONS -------------- 0 queries inside InnoDB, 0 queries in queue 0 read views open inside InnoDB Main thread process no. 54745, id 140604272338688, state: sleeping Number of rows inserted 528904, updated 1596758, deleted 99860, read 3325217158 5.40 inserts/s, 10.40 updates/s, 0.00 deletes/s, 122969.21 reads/s

So, yes, the database does have deadlocks, but they are very infrequent (the "latest" has been handled around 11 hours before the stats have been read).

I tried tracking the "SEMAPHORES" section values over a period of time, especially in a situation of normal operation and during a stall (I wrote a small script checking the MySQL server's processlist and running a couple of diagnostic commands into a log output in case of an obvious stall). As the numbers have been taken over different time frames, I have normalized the results to events/second:

 normal stall 1h avg 1m avg OS WAIT ARRAY INFO: reservation count 5,74 1,00 signal count 24,43 3,17 Mutex spin waits 1,32 5,67 rounds 8,33 25,85 OS waits 0,16 0,43 RW-shared spins 9,52 0,76 rounds 140,73 13,39 OS waits 2,60 0,27 RW-excl spins 6,36 1,08 rounds 178,42 16,51 OS waits 2,38 0,20

I am not quite sure about what I am seeing here. Most numbers have dropped by an order of magnitude - probably due to ceased update operations, "Mutex spin waits" and "Mutex spin rounds" however have both increased by the factor of 4.

Investigating this further, the list of mutexes (SHOW ENGINE INNODB MUTEX) has ~480 mutex entries listed both in normal operation as well as during a stall. I have enabled innodb_status_output_locks to see if it is going to give me more detail.

###Peculiarity More often than not, I notice in the mysql slow log that the oldest query stalling is an INSERT into a large-ish table with a VARCHAR primary key:

CREATE TABLE `files` ( `id_files` varchar(32) NOT NULL DEFAULT '', `filename` varchar(100) NOT NULL DEFAULT '', `content` text, PRIMARY KEY (`id_files`), KEY `filename` (`filename`), FULLTEXT KEY `content` (`content`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1

With primary keys being clustered indexes in an InnoDB table it might be that a page split is the trigger for the stall. The definition is similar for the sessions table, which is the second major contributor of entries in the slow log, although this simply might be due to the fact that it happens to be updated very frequently:

CREATE TABLE `sessions` ( `id_sessions` varchar(32) NOT NULL DEFAULT '', `hostname` varchar(255) DEFAULT NULL, [...] PRIMARY KEY (`id_sessions`), [...] ) ENGINE=InnoDB DEFAULT CHARSET=latin1

Configuration variables

(I've tinkered with most of them without definite success):

mysql> show global variables where variable_name like 'innodb_adaptive_flush%'; +------------------------------+-------+ | Variable_name | Value | +------------------------------+-------+ | innodb_adaptive_flushing | ON | | innodb_adaptive_flushing_lwm | 1 | +------------------------------+-------+ mysql> show global variables where variable_name like 'innodb_max_dirty_pages_pct%'; +--------------------------------+-------+ | Variable_name | Value | +--------------------------------+-------+ | innodb_max_dirty_pages_pct | 50 | | innodb_max_dirty_pages_pct_lwm | 10 | +--------------------------------+-------+ mysql> show global variables where variable_name like 'innodb_log_%'; +-----------------------------+-----------+ | Variable_name | Value | +-----------------------------+-----------+ | innodb_log_buffer_size | 8388608 | | innodb_log_compressed_pages | ON | | innodb_log_file_size | 268435456 | | innodb_log_files_in_group | 2 | | innodb_log_group_home_dir | ./ | +-----------------------------+-----------+ mysql> show global variables where variable_name like 'innodb_double%'; +--------------------+-------+ | Variable_name | Value | +--------------------+-------+ | innodb_doublewrite | ON | +--------------------+-------+ mysql> show global variables where variable_name like 'innodb_buffer_pool%'; +-------------------------------------+----------------+ | Variable_name | Value | +-------------------------------------+----------------+ | innodb_buffer_pool_dump_at_shutdown | OFF | | innodb_buffer_pool_dump_now | OFF | | innodb_buffer_pool_filename | ib_buffer_pool | | innodb_buffer_pool_instances | 8 | | innodb_buffer_pool_load_abort | OFF | | innodb_buffer_pool_load_at_startup | OFF | | innodb_buffer_pool_load_now | OFF | | innodb_buffer_pool_size | 29360128000 | +-------------------------------------+----------------+ mysql> show global variables where variable_name like 'innodb_io_capacity%'; +------------------------+-------+ | Variable_name | Value | +------------------------+-------+ | innodb_io_capacity | 200 | | innodb_io_capacity_max | 2000 | +------------------------+-------+ mysql> show global variables where variable_name like 'innodb_lru_scan_depth%'; +-----------------------+-------+ | Variable_name | Value | +-----------------------+-------+ | innodb_lru_scan_depth | 1024 | +-----------------------+-------+

Things already tried

disabling the query cache by SET GLOBAL query_cache_size=0
increasing innodb_log_buffer_size to 128M
playing around with innodb_adaptive_flushing, innodb_max_dirty_pages_pct and the respective _lwm values (they were set to defaults prior to my changes)
increasing innodb_io_capacity (2000) and innodb_io_capacity_max (4000)
setting innodb_flush_log_at_trx_commit = 2
setting the /sys/block/sda/queue/scheduler to noop or deadline

asked Sep 1, 2014 at 14:53

syneticon-dj