Agregation group by date or group, pair index, MySQL, slow response

Question

I would like to get an advice from you about manipulating lot of data using SUM and GROUP BY two columns.

I have a table with 60M rows:

mysql> SELECT COUNT(*) -> FROM adgroup_ad_diff; +----------+ | COUNT(*) | +----------+ | 59746727 | +----------+ 1 row in set (1.34 sec) mysql> DESCRIBE adgroup_ad_diff; +--------------------+---------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +--------------------+---------------+------+-----+---------+----------------+ | id | int | NO | PRI | NULL | auto_increment | | ad_group_name_id | int | YES | MUL | NULL | | | date_to | datetime | NO | MUL | NULL | | | cost | decimal(10,2) | NO | | NULL | | ... more columns +--------------------+---------------+------+-----+---------+----------------+ 10 rows in set (0.00 sec) mysql> SHOW INDEX FROM adgroup_ad_diff; +-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression | +-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+ | adgroup_ad_diff | 0 | PRIMARY | 1 | id | A | 58985640 | NULL | NULL | | BTREE | | | YES | NULL | | adgroup_ad_diff | 0 | search_idx | 1 | date_to | A | 116898 | NULL | NULL | | BTREE | | | YES | NULL | | adgroup_ad_diff | 0 | search_idx | 2 | ad_group_name_id | A | 58983692 | NULL | NULL | YES | BTREE | | | YES | NULL | | adgroup_ad_diff | 1 | IDX_CDAC9A9E3A2C86F0 | 1 | ad_group_name_id | A | 63372 | NULL | NULL | YES | BTREE | | | YES | NULL | +-----------------+------------+----------------------+--------------+------------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+

That table contains sum of costs for each groups. There are over 20k groups. Each date_to is dividable by 15minutes, so new rows are created every 15 minutes, like:

2022-01-01 10:00
2022-01-01 10:15
2022-01-01 10:30 etc.

And I run simple aggregation queries on that table only:

SELECT SQL_NO_CACHE ad_group_name_id, SUM(cost) AS cost FROM adgroup_ad_diff WHERE date_to >= '2021-05-20 10:01' AND date_to <= '2022-05-30 23:45' AND ad_group_name_id IN (... list of ids ...) GROUP BY ad_group_name_id;

and

SELECT SQL_NO_CACHE date_to, SUM(cost) AS cost FROM adgroup_ad_diff WHERE date_to >= '2021-05-20 10:01' AND date_to <= '2022-05-30 23:45' AND ad_group_name_id IN (... list of ids ...) GROUP BY date_to;

The list of IDS is the list that varies, from 30 ids to 15k ids, depends on selection of the group.

Usually, date_to range I'm using in WHERE condition is not big, like 1-10 days, so my idea was to create an index on pair date_to and ad_group_name_id, so MySQL engine can filter easily rows by date and then group them by date_to or ad_group_name_id.

But it does not work, it's very slow, the single query takes about 40 seconds. No matter what are date_to set to, if we take a day or a month or a year. Always about 40 seconds.

I wonder what can I do to improve that time. So for example query:

SELECT SQL_NO_CACHE date_to, SUM(cost) AS cost FROM adgroup_ad_diff WHERE date_to >= '2022-05-20 10:01' AND date_to <= '2022-05-30 23:45' GROUP BY date_to;

Could be executed faster than 40 seconds. New index? Partition? I will be really greateful for any tips.

Btw. in all queries I have SQL_NO_CACHE as I without it the results were not repeatable when I was testing response time.

Try EXPLAIN and USING INDEX first (if it's not using the index). Then try creating individual indexes for date_to and ad_group_name_id (it might be more efficient to use a single index and then a non-index technique for the second column) — Barry Carter
– Barry Carter, Commented May 31, 2022 at 14:38
@barrycarter I used explain and it shows me | 1 | SIMPLE | adgroup_ad_diff | NULL | range | search_idx,IDX_CDAC9A9E3A2C86F0 | search_idx | 10 | NULL | 2143584 | 4.08 | Using index condition | so it's using index I guess, I would try to make an index on date_to column as well, maybe it will help too, as I have separate index on ad_group_name_id already — atay
– atay, Commented Jun 1, 2022 at 8:42
@barrycarter I tried to use USE INDEX(search_idx) and it's much, much faster, results in less than 5 seconds for 20 days duration, 1,5 seconds for 5 days, Not sure why MySQL does not use this index, I guess because there is another one that could be used without groups. But that solves my problem, I think. Thank you — atay
– atay, Commented Jun 1, 2022 at 8:51

Rick James · Accepted Answer · 2022-05-31 17:56:01Z

That's one month's worth of data? Are you having any trouble with 20K inserts every 15 minutes? (I would expect not.)

The table can be shrunk:

ad_group_name_id probably does not need to be a 4-byte INT; for 20K groups, SMALLINT UNSIGNED would take only 2 bytes (but overflow at 65K).
id (4 bytes) is unnecessary. If you remove it, change the PRIMARY KEY as noted below.
DECIMAL(8,2) takes 4 bytes. (6,2) would take 3 bytes (but top out at 9999.99).

I recommend this pair of indexes to handle all 3 of your queries:

PRIMARY KEY(date_to, ad_group_name_id), INDEX(ad_group_name_id, date_to, cost)

A Summary table would work nicely for queries that cover whole days - midnight up to midnight. It would have 1 row per day per group, plus the SUM(cost). A simple aggregation at the end of each day would add the 20K new rows.

Then use the Summary table instead of the main table for all 3 of your queries, thereby running upwards of 96 times as fast. (It probably won't be more than 10 times as fast.)

It would have two indexes. (Note that day_to would be of type DATE.)

PRIMARY KEY(day_to, ad_group_name_id), INDEX(ad_group_name_id, day_to, sum_cost)

If you do add the Summary table, you could consider dropping the secondary index from the main table unless it is needed for other lookups based primarily on ad_group_name_id.

No issues with inserting 20k rows each 15 minutes, takes few seconds with batch insert. 1. 65k overflow of adgroups is too small, so I cannot 2. removing primary key is great idea to go, would try it 3. limiting decimal is also dangerous, could have values over 1k In the base query I'm doing SUM of 6 columns, but I simplified problem here, so I wonder if I should make an index for all of them. About summary table - I already did it and usei t, and it does work when group by adgroups, but cannot use it for group by date_to (using this values to generate chart). — atay
– atay, Commented Jun 1, 2022 at 8:33
@atay - If there are several columns that are being SUM'd, I would not lengthen any index with all of them. Instead, leave them all off, eg: INDEX(ad_group_name_id, date_to). That will make the index no longer "covering", but the alternative of 8 columns in an index is of dubious benefit. — Rick James
– Rick James, Commented Jun 1, 2022 at 14:54
@atay - (re: "cannot use...chart") -- Please provide the query and EXPLAIN SELECT.... — Rick James
– Rick James, Commented Jun 1, 2022 at 14:56

Wilson Hauck · Accepted Answer · 2022-06-04 00:23:24Z

For your 'simple aggregation' queries, consider changing

 WHERE date_to >= '2021-05-20 10:01' AND date_to <= '2022-05-30 23:45'

to

 WHERE date_to BETWEEN '2021-05-20 10:01' AND '2022-05-30 23:45'

for a single index pass. Feedback would be appreciated. It will be either 'better' or 'worse'.

Stack Exchange Network

Agregation group by date or group, pair index, MySQL, slow response

2 Answers 2

Hot Network Questions

Agregation group by date or group, pair index, MySQL, slow response

2 Answers 2

Related

Hot Network Questions