Improve Query Performance over Large Dataset

Question

As part of a daily cron job, I need to run a query that processes a whole lot of data. This data is related to the visitors coming to a website, and updating the data with what we have captured previously.

The query relies on 2 derived tables (select queries in the FROM section), to do its work —

SELECT new_visits.visitor_id, new_visits.visit_id, new_visits.visit_first_action_time, new_visits.purchased as purchased, ifnull(existing_visitors.purchased, 0) as existing_purchased FROM ( SELECT tv.visitor_id, tv.visit_id, tv.visit_first_action_time, if(tc.idgoal=0,1,0) as purchased FROM tbl_visit tv left outer join tbl_conversion tc ON tv.visit_id = tc.visit_id AND tc.idgoal = 0 WHERE tv.idsite= 12 AND tv.visit_id >= 477256 ORDER BY tv.visit_id LIMIT 1000 ) new_visits LEFT JOIN ( SELECT visitor_id, max(visit_seq) as visit_seq, purchased FROM tbl_last_input_visit where site_id = 12 GROUP BY visitor_id ) existing_visitors ON new_visits.visitor_id = existing_visitors.visitor_id ORDER BY new_visits.visitor_id, new_visits.visit_id;

With smaller datasets, this query works just fine. However, as the data increases, the slowly becomes progressively slower. Until a point where it starts to take around 30 seconds to executed (at the start it takes around 1.5 seconds).

The query plan is as follows —

+----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1000 | Using temporary; Using filesort | | 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 705325 | | | 3 | DERIVED | tbl_input_visit | ref | visitorid_seq,visitorid_idx | idvisitor_seq | 4 | | 490047 | Using where | | 2 | DERIVED | tv | range | PRIMARY,index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor | PRIMARY | 4 | NULL | 4781309 | Using where | | 2 | DERIVED | tc | ref | PRIMARY | PRIMARY | 8 | tv.idvisit | 1 | Using index | +----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+

At this point, one option I have explored is creation of temporary tables. However, the overhead of doing so is quite significant. I also realise that since this query relies on derived tables, MySQL will not be able to reuse any underlying indexes.

Here are the create statements for the tables involved —

CREATE TABLE `tbl_last_input_visit` ( `site_id` int(10) unsigned NOT NULL, `visitor_id` binary(8) NOT NULL, `visit_seq` int(10) unsigned NOT NULL, `purchase_cycle_seq` int(10) unsigned NOT NULL, `visit_in_cycle_seq` int(10) unsigned NOT NULL, `purchased` smallint(5) unsigned NOT NULL COMMENT 'l_ij', UNIQUE KEY `idvisitor_seq` (`site_id`,`idvisitor`,`visit_seq`), KEY `idvisitor_idx` (`site_id`,`idvisitor`) ) ENGINE=InnoDB CREATE TABLE `tbl_log_visit` ( `visit_id` int(10) unsigned NOT NULL AUTO_INCREMENT, `idsite` int(10) unsigned NOT NULL, `idvisitor` binary(8) NOT NULL, `visit_last_action_time` DATETIME, `config_id` int(10) unsigned NOT NULL, PRIMARY KEY (`visit_id`), KEY `index_idsite_config_datetime` (`site_id`,`config_id`,`visit_last_action_time`), KEY `index_idsite_datetime` (`site_id`,`visit_last_action_time`), KEY `index_idsite_idvisitor` (`site_id`,`visitor_id`) ) ENGINE=InnoDB CREATE TABLE `tbl_log_conversion` ( `visit_id` int(10) unsigned NOT NULL, `site_id` int(10) unsigned NOT NULL, `visitor_id` binary(8) NOT NULL, `idgoal` int(10) NOT NULL, `idorder` int(10) NOT NULL, PRIMARY KEY (`visit_id`,`idgoal`), UNIQUE KEY `unique_idsite_idorder` (`site_id`,`idorder`) ) ENGINE=InnoDB

Is there some way I can go about improving the performance of this query?

Actually, the value of purchased is used by the application layer to determine the next value for the purchase counter (which is maintained at that level). The purchased col value returned in the first derived query will only be a 1 or 0. It's simply a flag which is used to update the purchase counter retrieved from the second derived query. — anirvan
– anirvan, Commented Oct 7, 2013 at 18:23

Dmitriy Mozgovoy · Accepted Answer · 2013-10-18 21:36:08Z

First of all is goot way to help us to help you is show valid CREATE statements:

tbl_last_input_visit: #1072 - Key column 'idvisitor' doesn't exist in table tbl_log_visit: #1072 - Key column 'site_id' doesn't exist in table

Second: I will try to find out how to optimize it, but try to check derived queries - it can be the reason of slow processing.

Third: All this queries incompatible: there is no column visit_first_action_time in table tbl_log_visit (which should be named without (_log). Is the field "visit_first_action_time" same type as "visit_last_action_time"?

anirvan · Accepted Answer · 2013-10-22 05:58:33Z

So the root problem as @Dmitriy mentioned, had to do with the derived queries. Basically, when operating with humongous data sets, derived tables can lead to a whole lot of pain as the underlying indexes from the tables comprising the queries are not available for the derived queries.

In short, if you're writing a SELECT over a derived query over tblA and tblB, then the indexes of tblA and tblB are not available to the derived query. So, if the dataset returned from tblA and tblB is huge, the resulting query will be very slow.

I ended up fixing the solution by breaking apart the derived query into separate queries, and matching the results in the application layer. I also got a sizeable boost in performance by setting up indexes on the columns which were contributing to the GROUP BY clause in one of the queries. (A big thanks to @strawberry for that!)

Allan S. Hansen · Accepted Answer · 2013-10-08 07:47:18Z

Personally- I'd create actual, real tables, with indexes etc and then just truncate them prior to running your job.

Temporary tables and similar are nice enough for small things, but for complex things they quickly start to become increasingly slow due to the difficulty for the database engines to optimize the plans plus the memory usage of temporary values.

Izalias · Accepted Answer · 2020-07-18 00:34:27Z

I can see a really concerted effort to outsmart the query optimiser, which is quick way to make a query run like a pig, because you unwittingly goofed by nesting the worst possible operation within another operation.

ORDER BY.

A Non-Linearly Scaling Sort, the best you can hope for with ANY SORT is N(Log N) processing time, and you just chuck it into the sub-table so caviller, you don't think twice about the impact that it may have over a much larger data-set.

That's not the worst of it either, as you sort it on TWO LEVELS!

You have a group by, featuring a max sequence that does nothing as far as I can tell... the max value is not printed in the output and isn't invoked as a logical operation, SO WHY IS IT EVEN THERE??? It needs a grouping that is doing nothing but contributing to the performance issue.

Make sure your source dataset is indexed properly so you don't need to invoke the order by and then just simplify your query to the query optimiser isn't spitting Nested Loops and Sorts.

SELECT tv.visitor_id ,tv.visit_id ,tv.visit_first_action_time ,CAST(case ISNULL(tc.idgoal,1) when 0 then 1 else 0 end AS BIT) as purchased ,liv.purchased as existing_purchased FROM tbl_visit tv LEFT JOIN tbl_conversion tc ON tv.visit_id = tc.visit_id AND tc.idgoal = 0 LEFT JOIN tbl_last_input_visit liv ON liv.site_id = 12 AND tv.visitor_id = liv.visitor_id WHERE tv.idsite = 12 AND tv.visit_id >= 477256

When it comes to making things perform, you need to be really asking yourself the following questions.

Does this datatype need to be this big? (It means more storage down the pipeline)
Does this query need sorting? (It won't scale well if it does, invoke an index in advance.)
Do I use this output later on? (If not, get rid of it!)

Stack Exchange Network

Improve Query Performance over Large Dataset

4 Answers 4

Linked

Hot Network Questions

Improve Query Performance over Large Dataset

4 Answers 4

Linked

Related

Hot Network Questions