As part of a daily cron job, I need to run a query that processes a whole lot of data. This data is related to the visitors coming to a website, and updating the data with what we have captured previously.
The query relies on 2 derived tables (select queries in the FROM section), to do its work —
SELECT new_visits.visitor_id, new_visits.visit_id, new_visits.visit_first_action_time, new_visits.purchased as purchased, ifnull(existing_visitors.purchased, 0) as existing_purchased FROM ( SELECT tv.visitor_id, tv.visit_id, tv.visit_first_action_time, if(tc.idgoal=0,1,0) as purchased FROM tbl_visit tv left outer join tbl_conversion tc ON tv.visit_id = tc.visit_id AND tc.idgoal = 0 WHERE tv.idsite= 12 AND tv.visit_id >= 477256 ORDER BY tv.visit_id LIMIT 1000 ) new_visits LEFT JOIN ( SELECT visitor_id, max(visit_seq) as visit_seq, purchased FROM tbl_last_input_visit where site_id = 12 GROUP BY visitor_id ) existing_visitors ON new_visits.visitor_id = existing_visitors.visitor_id ORDER BY new_visits.visitor_id, new_visits.visit_id; With smaller datasets, this query works just fine. However, as the data increases, the slowly becomes progressively slower. Until a point where it starts to take around 30 seconds to executed (at the start it takes around 1.5 seconds).
The query plan is as follows —
+----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+ | 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1000 | Using temporary; Using filesort | | 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 705325 | | | 3 | DERIVED | tbl_input_visit | ref | visitorid_seq,visitorid_idx | idvisitor_seq | 4 | | 490047 | Using where | | 2 | DERIVED | tv | range | PRIMARY,index_idsite_config_datetime,index_idsite_datetime,index_idsite_idvisitor | PRIMARY | 4 | NULL | 4781309 | Using where | | 2 | DERIVED | tc | ref | PRIMARY | PRIMARY | 8 | tv.idvisit | 1 | Using index | +----+-------------+------------------------+-------+-----------------------------------------------------------------------------------+---------------+---------+-------------------+---------+---------------------------------+ At this point, one option I have explored is creation of temporary tables. However, the overhead of doing so is quite significant. I also realise that since this query relies on derived tables, MySQL will not be able to reuse any underlying indexes.
Here are the create statements for the tables involved —
CREATE TABLE `tbl_last_input_visit` ( `site_id` int(10) unsigned NOT NULL, `visitor_id` binary(8) NOT NULL, `visit_seq` int(10) unsigned NOT NULL, `purchase_cycle_seq` int(10) unsigned NOT NULL, `visit_in_cycle_seq` int(10) unsigned NOT NULL, `purchased` smallint(5) unsigned NOT NULL COMMENT 'l_ij', UNIQUE KEY `idvisitor_seq` (`site_id`,`idvisitor`,`visit_seq`), KEY `idvisitor_idx` (`site_id`,`idvisitor`) ) ENGINE=InnoDB CREATE TABLE `tbl_log_visit` ( `visit_id` int(10) unsigned NOT NULL AUTO_INCREMENT, `idsite` int(10) unsigned NOT NULL, `idvisitor` binary(8) NOT NULL, `visit_last_action_time` DATETIME, `config_id` int(10) unsigned NOT NULL, PRIMARY KEY (`visit_id`), KEY `index_idsite_config_datetime` (`site_id`,`config_id`,`visit_last_action_time`), KEY `index_idsite_datetime` (`site_id`,`visit_last_action_time`), KEY `index_idsite_idvisitor` (`site_id`,`visitor_id`) ) ENGINE=InnoDB CREATE TABLE `tbl_log_conversion` ( `visit_id` int(10) unsigned NOT NULL, `site_id` int(10) unsigned NOT NULL, `visitor_id` binary(8) NOT NULL, `idgoal` int(10) NOT NULL, `idorder` int(10) NOT NULL, PRIMARY KEY (`visit_id`,`idgoal`), UNIQUE KEY `unique_idsite_idorder` (`site_id`,`idorder`) ) ENGINE=InnoDB Is there some way I can go about improving the performance of this query?
purchasedis used by the application layer to determine the next value for the purchase counter (which is maintained at that level). Thepurchasedcol value returned in the first derived query will only be a 1 or 0. It's simply a flag which is used to update the purchase counter retrieved from the second derived query.