I would like to get an advice from you about manipulating lot of data using SUM and GROUP BY two columns.
I have a table with 60M rows:
mysql> SELECT COUNT(*) -> FROM adgroup_ad_diff; +----------+ | COUNT(*) | +----------+ | 59746727 | +----------+ 1 row in set (1.34 sec) mysql> DESCRIBE adgroup_ad_diff; +--------------------+---------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------------------+---------------+------+-----+---------+----------------+ | id | int | NO | PRI | NULL | auto_increment | | ad_group_name_id | int | YES | MUL | NULL | | | date_to | datetime | NO | MUL | NULL | | | cost | decimal(10,2) | NO | | NULL | | ... more columns +--------------------+---------------+------+-----+---------+----------------+ 10 rows in set (0.00 sec) mysql> SHOW INDEX FROM adgroup_ad_diff; +-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression | +-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+ | adgroup_ad_diff | 0 | PRIMARY | 1 | id | A | 58985640 | NULL | NULL | | BTREE | | | YES | NULL | | adgroup_ad_diff | 0 | search_idx | 1 | date_to | A | 116898 | NULL | NULL | | BTREE | | | YES | NULL | | adgroup_ad_diff | 0 | search_idx | 2 | ad_group_name_id | A | 58983692 | NULL | NULL | YES | BTREE | | | YES | NULL | | adgroup_ad_diff | 1 | IDX_CDAC9A9E3A2C86F0 | 1 | ad_group_name_id | A | 63372 | NULL | NULL | YES | BTREE | | | YES | NULL | +-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+ That table contains sum of costs for each groups. There are over 20k groups. Each date_to is dividable by 15minutes, so new rows are created every 15 minutes, like:
- 2022-01-01 10:00
- 2022-01-01 10:15
- 2022-01-01 10:30 etc.
And I run simple aggregation queries on that table only:
SELECT SQL_NO_CACHE ad_group_name_id, SUM(cost) AS cost FROM adgroup_ad_diff WHERE date_to >= '2021-05-20 10:01' AND date_to <= '2022-05-30 23:45' AND ad_group_name_id IN (... list of ids ...) GROUP BY ad_group_name_id; and
SELECT SQL_NO_CACHE date_to, SUM(cost) AS cost FROM adgroup_ad_diff WHERE date_to >= '2021-05-20 10:01' AND date_to <= '2022-05-30 23:45' AND ad_group_name_id IN (... list of ids ...) GROUP BY date_to; The list of IDS is the list that varies, from 30 ids to 15k ids, depends on selection of the group.
Usually, date_to range I'm using in WHERE condition is not big, like 1-10 days, so my idea was to create an index on pair date_to and ad_group_name_id, so MySQL engine can filter easily rows by date and then group them by date_to or ad_group_name_id.
But it does not work, it's very slow, the single query takes about 40 seconds. No matter what are date_to set to, if we take a day or a month or a year. Always about 40 seconds.
I wonder what can I do to improve that time. So for example query:
SELECT SQL_NO_CACHE date_to, SUM(cost) AS cost FROM adgroup_ad_diff WHERE date_to >= '2022-05-20 10:01' AND date_to <= '2022-05-30 23:45' GROUP BY date_to; Could be executed faster than 40 seconds. New index? Partition? I will be really greateful for any tips.
Btw. in all queries I have SQL_NO_CACHE as I without it the results were not repeatable when I was testing response time.
USE INDEX(search_idx)and it's much, much faster, results in less than 5 seconds for 20 days duration, 1,5 seconds for 5 days, Not sure why MySQL does not use this index, I guess because there is another one that could be used without groups. But that solves my problem, I think. Thank you