MySQL Query Optimisation 101
€ whoami ● Federico Razzoli ● Freelance consultant ● Writing SQL since MySQL 2.23 hello@federico-razzoli.com ● I love open source, sharing, Collaboration, win-win, etc ● I love MariaDB, MySQL, Postgres, etc ○ Even Db2, somehow
Why is the database important?
Remember the Von Neumann machine?
It’s always about Data ● Since then, the purpose of hardware and software never changed: ○ Receive data ○ Process data ○ Output data
A rose by any other name... ● Feel free to use synonyms ○ Validate ○ Sanitise ○ Parse ○ Persist ○ Normalise / Denormalise ○ Cache ○ Map / Reduce ○ Print ○ Ping ○ ...
...would smell as sweet ● The database of known stars is not a ping package ● You use a DBMS to abstract data management as much as possible ○ Persistence ○ Consistence ○ Queries ○ Fast search ○ … ● That’s why “database is magic”
Busy ● But it’s just a very busy person, performing many tasks concurrently ● And each is: ○ Important (must be reliable) ○ Complex (must be fast) ○ Expected (if something goes wrong, you will complain) In practice, the DBMS is usually the bottleneck.
Terminology Statement: An SQL command Query: A statement that returns a resultset ...or any other statement :) Resultset: output 0 or more rows Optimiser / Query Planner: Component responsible of deciding a query’s execution plan Optimised query: A query whose execution plan is reasonably good ...this doesn’t imply in any way that the query is fast Database: set of tables (schema) Instance / Server: running MySQL daemon (cluster)
Performance
When should a query be optimised? mysql> EXPLAIN SELECT * FROM t WHERE c < 10 G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t partitions: NULL type: ALL possible_keys: idx_c key: idx_c key_len: 4 ref: NULL rows: 213030 filtered: 50.00 Extra: Using where 1 row in set, 1 warning (0.01 sec)
When should a query be optimised? mysql> SELECT * FROM performance_schema.events_statements_summary_by_digestWHERE DIGEST = '254a65744e661e072103b7a7630dee1c3a3b8e906f19889f7c796aebe7cdd4f8' G *************************** 1. row *************************** SCHEMA_NAME: test DIGEST: 254a65744e661e072103b7a7630dee1c3a3b8e906f19889f7c796aebe7cdd4f8 DIGEST_TEXT: SELECT * FROM `t` WHERE `c` < ? COUNT_STAR: 1 ... SUM_ROWS_AFFECTED: 0 SUM_ROWS_SENT: 57344 SUM_ROWS_EXAMINED: 212992 SUM_CREATED_TMP_DISK_TABLES: 0 SUM_CREATED_TMP_TABLES: 0 SUM_SELECT_FULL_JOIN: 0 SUM_SELECT_FULL_RANGE_JOIN: 0 SUM_SELECT_RANGE: 0 SUM_SELECT_RANGE_CHECK: 0 SUM_SELECT_SCAN: 1 SUM_SORT_MERGE_PASSES: 0 SUM_SORT_RANGE: 0 SUM_SORT_ROWS: 0 SUM_SORT_SCAN: 0 SUM_NO_INDEX_USED: 1 SUM_NO_GOOD_INDEX_USED: 0 FIRST_SEEN: 2019-05-14 00:31:24.078967 LAST_SEEN: 2019-05-14 00:31:24.078967 ... QUERY_SAMPLE_TEXT: SELECT * FROM t WHERE c < 10 QUERY_SAMPLE_SEEN: 2019-05-14 00:31:24.078967 QUERY_SAMPLE_TIMER_WAIT: 117493874000
But how do I find impacting queries? ● It depends what you mean by “impacting” ● There are several monitoring methods (USE, etc) ● But 3 philosophies:
But how do I find impacting queries? ● It depends what you mean by “impacting” ● There are several monitoring methods (USE, etc) ● But 3 philosophies: ○ Panicking when you hear that something is down or slow
But how do I find impacting queries? ● It depends what you mean by “impacting” ● There are several monitoring methods (USE, etc) ● But 3 philosophies: ○ Panicking when you hear that something is down or slow ○ System-centric monitoring
But how do I find impacting queries? ● It depends what you mean by “impacting” ● There are several monitoring methods (USE, etc) ● But 3 philosophies: ○ Panicking when you hear that something is down or slow ○ System-centric monitoring ○ User-centric
But how do I find impacting queries? ● Panicking when you hear that something is down or slow ● System-centric monitoring ● User-centric You can use them all.
Panicking ● Simplest method ● Do nothing do prevent anything ○ Optionally, take a lot of actions to prevent imaginary problems in imaginary ways ○ There is no evidence that your job is useless, so your boss will not fire you
System-centric ● pt-query-digest, PMM, etc ● Merge queries into one, normalising its text and replacing parameters ○ SELECT * FROM t WHERE b= 111 AND a = 0 -- comment ○ Select * From t Where a = 24 and b=42; ○ SELECT * FROM t WHERE a = ? AND b = ? ● Sum execution time of each occurrence (Grand Total Time) ● Optimise the queries with highest GTT
User-Centric ● Calculate the cost of slowness (users don’t buy, maybe leave 4ever) ● Cost of slowness is different for different ○ URLs ○ Number of users ○ ...other variables that depend on your business ■ (day of month, country, etc) ● Set Service Level Objectives ● Monitor the HTTP calls latency, and the involved services ● Find out what’s slowing them down
Query Performance
What makes a query “important”? ● How many times it’s executed ● It’s locking
What makes a query slow? ● Number of rows read ○ Read != return ○ Indexes are there to lower the number of reads ● Number of rows written ○ In-memory temp tables are not good ○ On-disk temp tables are worse
How do I optimise a query? ● Use indexes properly ● Avoid creation of temp tables, if possible
What is an index?
Index Types ● BTREE - ordered data structure ● HASH - hash table ● PostgreSQL has much more ● Each storage engine can implement any of both ● InnoDB uses BTREE and internally uses HASH when it thinks it’s better ● The syntax CREATE INDEX USING [BTREE | HASH] is generally useless We will focus on BTREE indexes in InnoDB
Index Properties ● Primary key: unique values, not null ● UNIQUE ● Multiple columns ○ The order matters ● Column prefix (only for strings)
InnoDB Indexes ● InnoDB tables should always have a Primary Key ● The table is stored ordered by primary key ● The table itself is the primary key ○ Columns “not part of the primary key” simply don’t affect the order of rows
InnoDB Indexes Table columns: {a, b, c} Primary key: {a, b} A B C 1 1 4 1 2 1 2 1 9 2 2 3 3 0 3 4 20 0
InnoDB Indexes ● Secondary indexes are stored separately ● They are ordered by the indexed column ● Each entry contain a reference to a primary key entry
InnoDB Indexes Primary key: {a, b} Index idx_c: {c} c a b 0 4 20 1 1 2 3 2 2 3 3 0 4 1 1 9 2 1
Which queries will be faster? Table columns: {a, b, c, d, e} Primary key: {a, b} Index idx_c: {c} ● SELECT * FROM t WHERE a = 1 ● SELECT * FROM t WHERE a = 1 AND b = 2 ● SELECT a, b FROM t WHERE c = 0 ● SELECT d, e FROM t WHERE c = 0
More performance considerations?
More performance considerations? ● Big primary key = big indexes ● Primary key should be append-only INTEGER UNSIGNED AUTO_INCREMENT ● These indexes are duplicates: {a} - {a, b} ● This index is wrong: {a, id}
Index implementation
https://github.com/jeremycole/innodb_diagrams
More performance considerations?
More performance considerations? ● Writing to an index is relatively slow ● Deleting many rows leaves fragmented indexes
WHERE is the index?
Phone Book ● Indexes are ordered data structures ● Think to them as a phone book Table: {first_name, last_name, phone, address} Index: {last_name, first_name}
Phone Book ● I will show you some queries, and you will tell me which can be solved by using the index ● You may not know, but your mind contains a pretty good SQL optimiser Table: {first_name, last_name, phone, address} Index: {last_name, first_name}
Queries SELECT * FROM phone_book … WHERE last_name = 'Baker' WHERE last_name IN ('Hartnell','Baker', 'Whittaker') WHERE last_name > 'Baker' WHERE last_name >= 'Baker' WHERE last_name < 'Baker' WHERE last_name <= 'Baker' WHERE last_name <> 'Baker'
Queries SELECT * FROM phone_book … WHERE last_name IS NULL WHERE last_name IS NOT NULL
Rule #1 A BTREE can optimise point searches and ranges
Queries WHERE last_name >= 'B' AND last_name < 'C' WHERE last_name BETWEEN 'B' AND 'C' WHERE last_name LIKE 'B%'
Queries WHERE last_name LIKE 'B%' WHERE last_name LIKE '%B%' WHERE last_name LIKE '%B' WHERE last_name LIKE 'B_' WHERE last_name LIKE '_B_' WHERE last_name LIKE '_B'
Rule #2 A LIKE condition whose second operand starts with a 'constant string' is a range
Queries WHERE first_name = 'Tom' WHERE last_name = 'Baker' WHERE first_name = 'Tom' AND last_name = 'Baker' WHERE last_name = 'Baker' AND first_name = 'Tom'
Rule #3 We can use a whole index or its leftmost part
Queries WHERE LEFT(last_name, 2) = 'Ba' WHERE last_name = CONCAT('Ba', 'ker')
Rule #4 Optimiser cannot make assumptions on functions/expression results. However, wrapping a constant value into a function will produce another constant value, which is mostly irrelevant for query optimisation.
Queries WHERE last_name = first_name
Rule #5 Comparing a column with another results in a comparison whose operands change at every row. The optimiser cannot filter out any row in advance.
Queries WHERE last_name = 'Baker' AND phone = '+44 7739 427279'
Rule #6 We can use an index to restrict the search to a set of rows And search those rows in a non-optimised fashion Depending on this set’s size, this could be a brilliant or a terrible strategy
Queries WHERE last_name = 'Baker' AND first_name > 'Tom' WHERE last_name > 'Baker' AND first_name = 'Tom' WHERE last_name > 'Baker' AND first_name > 'Tom'
Queries WHERE last_name = 'Baker' AND first_name > 'Tom' WHERE first_name = 'Tom' AND last_name > 'Baker' WHERE first_name > 'Tom' AND last_name = 'Baker' WHERE last_name > 'Baker' AND first_name > 'Tom' Baker, Colin Baker, Tom Baker, Walter Capaldi, Ada Capaldi, Peter Whittaker, Jody Whittaker, Vadim
Rule #7 If we have a range condition on an index column The next index columns cannot be used If you prefer: Index usage stops at the first >
ORDER BY, GROUP BY
Mr Speaker talks to MySQL
Queries ORDER BY last_name ORDER BY first_name ORDER BY last_name, first_name ORDER BY first_name, last_name
Queries GROUP BY last_name GROUP BY first_name GROUP BY last_name, first_name GROUP BY first_name, last_name
Rule #8 ORDER BY and GROUP BY can take advantage of an index order or create an internal temp table Note: GROUP BY optimisation also depends on the function we’re using (MAX, COUNT…).
Queries WHERE last_name > 'Baker' ORDER BY last_name WHERE last_name = 'Baker' ORDER BY first_name WHERE last_name > 'Baker' ORDER BY first_name
Rule #9 If we have an ORDER BY / GROUP BY on an index column The next index columns cannot be used
Multiple Indexes
Queries Table: {id, a, b, c, d} idx_a: {a, d} idx_b: {b} WHERE a = 10 OR a = 20 WHERE a = 24 OR c = 42 WHERE a = 24 OR d = 42 WHERE a = 24 AND b = 42 WHERE a = 24 OR b = 42 WHERE a = 24 ORDER BY b GROUP BY a ORDER BY b
Rule #10 Using multiple indexes for AND or OR (intersect) is possible, but there is a benefit only if we read MANY rows Using different indexes in WHERE / GROUP BY / ORDER BY is not possible
Thank you kindly! https://federico-razzoli.com info@federico-razzoli.com
MySQL Query Optimisation 101

MySQL Query Optimisation 101

  • 1.
  • 2.
    € whoami ● FedericoRazzoli ● Freelance consultant ● Writing SQL since MySQL 2.23 hello@federico-razzoli.com ● I love open source, sharing, Collaboration, win-win, etc ● I love MariaDB, MySQL, Postgres, etc ○ Even Db2, somehow
  • 3.
    Why is thedatabase important?
  • 4.
    Remember the VonNeumann machine?
  • 5.
    It’s always aboutData ● Since then, the purpose of hardware and software never changed: ○ Receive data ○ Process data ○ Output data
  • 6.
    A rose byany other name... ● Feel free to use synonyms ○ Validate ○ Sanitise ○ Parse ○ Persist ○ Normalise / Denormalise ○ Cache ○ Map / Reduce ○ Print ○ Ping ○ ...
  • 7.
    ...would smell assweet ● The database of known stars is not a ping package ● You use a DBMS to abstract data management as much as possible ○ Persistence ○ Consistence ○ Queries ○ Fast search ○ … ● That’s why “database is magic”
  • 8.
    Busy ● But it’sjust a very busy person, performing many tasks concurrently ● And each is: ○ Important (must be reliable) ○ Complex (must be fast) ○ Expected (if something goes wrong, you will complain) In practice, the DBMS is usually the bottleneck.
  • 9.
    Terminology Statement: An SQLcommand Query: A statement that returns a resultset ...or any other statement :) Resultset: output 0 or more rows Optimiser / Query Planner: Component responsible of deciding a query’s execution plan Optimised query: A query whose execution plan is reasonably good ...this doesn’t imply in any way that the query is fast Database: set of tables (schema) Instance / Server: running MySQL daemon (cluster)
  • 10.
  • 11.
    When should aquery be optimised? mysql> EXPLAIN SELECT * FROM t WHERE c < 10 G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: t partitions: NULL type: ALL possible_keys: idx_c key: idx_c key_len: 4 ref: NULL rows: 213030 filtered: 50.00 Extra: Using where 1 row in set, 1 warning (0.01 sec)
  • 12.
    When should aquery be optimised? mysql> SELECT * FROM performance_schema.events_statements_summary_by_digestWHERE DIGEST = '254a65744e661e072103b7a7630dee1c3a3b8e906f19889f7c796aebe7cdd4f8' G *************************** 1. row *************************** SCHEMA_NAME: test DIGEST: 254a65744e661e072103b7a7630dee1c3a3b8e906f19889f7c796aebe7cdd4f8 DIGEST_TEXT: SELECT * FROM `t` WHERE `c` < ? COUNT_STAR: 1 ... SUM_ROWS_AFFECTED: 0 SUM_ROWS_SENT: 57344 SUM_ROWS_EXAMINED: 212992 SUM_CREATED_TMP_DISK_TABLES: 0 SUM_CREATED_TMP_TABLES: 0 SUM_SELECT_FULL_JOIN: 0 SUM_SELECT_FULL_RANGE_JOIN: 0 SUM_SELECT_RANGE: 0 SUM_SELECT_RANGE_CHECK: 0 SUM_SELECT_SCAN: 1 SUM_SORT_MERGE_PASSES: 0 SUM_SORT_RANGE: 0 SUM_SORT_ROWS: 0 SUM_SORT_SCAN: 0 SUM_NO_INDEX_USED: 1 SUM_NO_GOOD_INDEX_USED: 0 FIRST_SEEN: 2019-05-14 00:31:24.078967 LAST_SEEN: 2019-05-14 00:31:24.078967 ... QUERY_SAMPLE_TEXT: SELECT * FROM t WHERE c < 10 QUERY_SAMPLE_SEEN: 2019-05-14 00:31:24.078967 QUERY_SAMPLE_TIMER_WAIT: 117493874000
  • 13.
    But how doI find impacting queries? ● It depends what you mean by “impacting” ● There are several monitoring methods (USE, etc) ● But 3 philosophies:
  • 14.
    But how doI find impacting queries? ● It depends what you mean by “impacting” ● There are several monitoring methods (USE, etc) ● But 3 philosophies: ○ Panicking when you hear that something is down or slow
  • 15.
    But how doI find impacting queries? ● It depends what you mean by “impacting” ● There are several monitoring methods (USE, etc) ● But 3 philosophies: ○ Panicking when you hear that something is down or slow ○ System-centric monitoring
  • 16.
    But how doI find impacting queries? ● It depends what you mean by “impacting” ● There are several monitoring methods (USE, etc) ● But 3 philosophies: ○ Panicking when you hear that something is down or slow ○ System-centric monitoring ○ User-centric
  • 17.
    But how doI find impacting queries? ● Panicking when you hear that something is down or slow ● System-centric monitoring ● User-centric You can use them all.
  • 18.
    Panicking ● Simplest method ●Do nothing do prevent anything ○ Optionally, take a lot of actions to prevent imaginary problems in imaginary ways ○ There is no evidence that your job is useless, so your boss will not fire you
  • 19.
    System-centric ● pt-query-digest, PMM,etc ● Merge queries into one, normalising its text and replacing parameters ○ SELECT * FROM t WHERE b= 111 AND a = 0 -- comment ○ Select * From t Where a = 24 and b=42; ○ SELECT * FROM t WHERE a = ? AND b = ? ● Sum execution time of each occurrence (Grand Total Time) ● Optimise the queries with highest GTT
  • 20.
    User-Centric ● Calculate thecost of slowness (users don’t buy, maybe leave 4ever) ● Cost of slowness is different for different ○ URLs ○ Number of users ○ ...other variables that depend on your business ■ (day of month, country, etc) ● Set Service Level Objectives ● Monitor the HTTP calls latency, and the involved services ● Find out what’s slowing them down
  • 21.
  • 22.
    What makes aquery “important”? ● How many times it’s executed ● It’s locking
  • 23.
    What makes aquery slow? ● Number of rows read ○ Read != return ○ Indexes are there to lower the number of reads ● Number of rows written ○ In-memory temp tables are not good ○ On-disk temp tables are worse
  • 24.
    How do Ioptimise a query? ● Use indexes properly ● Avoid creation of temp tables, if possible
  • 25.
    What is anindex?
  • 26.
    Index Types ● BTREE- ordered data structure ● HASH - hash table ● PostgreSQL has much more ● Each storage engine can implement any of both ● InnoDB uses BTREE and internally uses HASH when it thinks it’s better ● The syntax CREATE INDEX USING [BTREE | HASH] is generally useless We will focus on BTREE indexes in InnoDB
  • 27.
    Index Properties ● Primarykey: unique values, not null ● UNIQUE ● Multiple columns ○ The order matters ● Column prefix (only for strings)
  • 28.
    InnoDB Indexes ● InnoDBtables should always have a Primary Key ● The table is stored ordered by primary key ● The table itself is the primary key ○ Columns “not part of the primary key” simply don’t affect the order of rows
  • 29.
    InnoDB Indexes Table columns:{a, b, c} Primary key: {a, b} A B C 1 1 4 1 2 1 2 1 9 2 2 3 3 0 3 4 20 0
  • 30.
    InnoDB Indexes ● Secondaryindexes are stored separately ● They are ordered by the indexed column ● Each entry contain a reference to a primary key entry
  • 31.
    InnoDB Indexes Primary key:{a, b} Index idx_c: {c} c a b 0 4 20 1 1 2 3 2 2 3 3 0 4 1 1 9 2 1
  • 32.
    Which queries willbe faster? Table columns: {a, b, c, d, e} Primary key: {a, b} Index idx_c: {c} ● SELECT * FROM t WHERE a = 1 ● SELECT * FROM t WHERE a = 1 AND b = 2 ● SELECT a, b FROM t WHERE c = 0 ● SELECT d, e FROM t WHERE c = 0
  • 33.
  • 34.
    More performance considerations? ●Big primary key = big indexes ● Primary key should be append-only INTEGER UNSIGNED AUTO_INCREMENT ● These indexes are duplicates: {a} - {a, b} ● This index is wrong: {a, id}
  • 35.
  • 36.
  • 37.
  • 38.
    More performance considerations? ●Writing to an index is relatively slow ● Deleting many rows leaves fragmented indexes
  • 39.
  • 40.
    Phone Book ● Indexesare ordered data structures ● Think to them as a phone book Table: {first_name, last_name, phone, address} Index: {last_name, first_name}
  • 41.
    Phone Book ● Iwill show you some queries, and you will tell me which can be solved by using the index ● You may not know, but your mind contains a pretty good SQL optimiser Table: {first_name, last_name, phone, address} Index: {last_name, first_name}
  • 42.
    Queries SELECT * FROMphone_book … WHERE last_name = 'Baker' WHERE last_name IN ('Hartnell','Baker', 'Whittaker') WHERE last_name > 'Baker' WHERE last_name >= 'Baker' WHERE last_name < 'Baker' WHERE last_name <= 'Baker' WHERE last_name <> 'Baker'
  • 43.
    Queries SELECT * FROMphone_book … WHERE last_name IS NULL WHERE last_name IS NOT NULL
  • 44.
    Rule #1 A BTREEcan optimise point searches and ranges
  • 45.
    Queries WHERE last_name >='B' AND last_name < 'C' WHERE last_name BETWEEN 'B' AND 'C' WHERE last_name LIKE 'B%'
  • 46.
    Queries WHERE last_name LIKE'B%' WHERE last_name LIKE '%B%' WHERE last_name LIKE '%B' WHERE last_name LIKE 'B_' WHERE last_name LIKE '_B_' WHERE last_name LIKE '_B'
  • 47.
    Rule #2 A LIKEcondition whose second operand starts with a 'constant string' is a range
  • 48.
    Queries WHERE first_name ='Tom' WHERE last_name = 'Baker' WHERE first_name = 'Tom' AND last_name = 'Baker' WHERE last_name = 'Baker' AND first_name = 'Tom'
  • 49.
    Rule #3 We canuse a whole index or its leftmost part
  • 50.
    Queries WHERE LEFT(last_name, 2)= 'Ba' WHERE last_name = CONCAT('Ba', 'ker')
  • 51.
    Rule #4 Optimiser cannotmake assumptions on functions/expression results. However, wrapping a constant value into a function will produce another constant value, which is mostly irrelevant for query optimisation.
  • 52.
  • 53.
    Rule #5 Comparing acolumn with another results in a comparison whose operands change at every row. The optimiser cannot filter out any row in advance.
  • 54.
    Queries WHERE last_name ='Baker' AND phone = '+44 7739 427279'
  • 55.
    Rule #6 We canuse an index to restrict the search to a set of rows And search those rows in a non-optimised fashion Depending on this set’s size, this could be a brilliant or a terrible strategy
  • 56.
    Queries WHERE last_name ='Baker' AND first_name > 'Tom' WHERE last_name > 'Baker' AND first_name = 'Tom' WHERE last_name > 'Baker' AND first_name > 'Tom'
  • 57.
    Queries WHERE last_name ='Baker' AND first_name > 'Tom' WHERE first_name = 'Tom' AND last_name > 'Baker' WHERE first_name > 'Tom' AND last_name = 'Baker' WHERE last_name > 'Baker' AND first_name > 'Tom' Baker, Colin Baker, Tom Baker, Walter Capaldi, Ada Capaldi, Peter Whittaker, Jody Whittaker, Vadim
  • 58.
    Rule #7 If wehave a range condition on an index column The next index columns cannot be used If you prefer: Index usage stops at the first >
  • 59.
  • 60.
  • 61.
    Queries ORDER BY last_name ORDERBY first_name ORDER BY last_name, first_name ORDER BY first_name, last_name
  • 62.
    Queries GROUP BY last_name GROUPBY first_name GROUP BY last_name, first_name GROUP BY first_name, last_name
  • 63.
    Rule #8 ORDER BYand GROUP BY can take advantage of an index order or create an internal temp table Note: GROUP BY optimisation also depends on the function we’re using (MAX, COUNT…).
  • 64.
    Queries WHERE last_name >'Baker' ORDER BY last_name WHERE last_name = 'Baker' ORDER BY first_name WHERE last_name > 'Baker' ORDER BY first_name
  • 65.
    Rule #9 If wehave an ORDER BY / GROUP BY on an index column The next index columns cannot be used
  • 66.
  • 67.
    Queries Table: {id, a,b, c, d} idx_a: {a, d} idx_b: {b} WHERE a = 10 OR a = 20 WHERE a = 24 OR c = 42 WHERE a = 24 OR d = 42 WHERE a = 24 AND b = 42 WHERE a = 24 OR b = 42 WHERE a = 24 ORDER BY b GROUP BY a ORDER BY b
  • 68.
    Rule #10 Using multipleindexes for AND or OR (intersect) is possible, but there is a benefit only if we read MANY rows Using different indexes in WHERE / GROUP BY / ORDER BY is not possible
  • 69.