0
-- The CTE1 is just to create columns that dictate the bound of what is considered the same entry -- Also I do a dense rank by ACT_TIME, and a PARITION BY on ID1, ID2 -- so all ID1/ID2 combos are ranked by when they ran WITH cte1 AS (SELECT *, (X-2) AS X_START, (X+2) AS X_END, (Y-2) AS Y_START, (Y+2) AS Y_END, (Z*1.2) AS Z_MAX, DENSE_RANK() OVER (PARTITION BY ID1, ID2 ORDER BY ACT_TIME) AS DENSE_RANK FROM data ORDER BY data.ACT_TIME) ,cte2 AS ( -- Create new set of column as comparisons SELECT ID AS ID_COMP, ID1 AS ID1_COMP, ID2 AS ID2_COMP, X_START AS X_START_COMP, X_END AS X_END_COMP, Y_START AS Y_START_COMP, Y_END AS Y_END_COMP, Z AS Z_MAX_COMP, DENSE_RANK AS DENSE_RANK_COMP FROM cte1) , cte3 AS ( -- join cte1 on cte2 only when X and Y value from cte1 is between the limits of cte2 AND -- Z max value from cte 2 is larger than Z value from cte1 AND ID1/ID2 match -- The result will have an ID of a row that should be removed since their x and y was in between the limits -- Then remove any rows where rank from cte2 is higher than cte1 -- Remove any rows that were joined onto it self SELECT cte1.* , cte2.* FROM cte1 JOIN cte2 ON (( cte2.X_END_COMP >= cte1.X AND cte1.X >= cte2.X_START_COMP) AND (cte2.Y_END_COMP >= cte1.Y AND cte1.Y>= cte2.Y_START_COMP) AND (cte1.Z < cte2.Z_MAX_COMP) AND (cte2.ID2_COMP = cte1.ID2) AND (cte2.ID1_COMP = cte1.ID1)) WHERE cte1.ID <> cte2.ID_COMP AND cte2.DENSE_RANK_COMP <= cte1.DENSE_RANK) -- Any IDs that shows up in cte3 remove from the final result SELECT data.* FROM data WHERE ID NOT IN (SELECT DISTINCT ID FROM cte3) ORDER BY data.ACT_TIME 

Here's my create table

CREATE TABLE `data` ( `ID` INT(11) NOT NULL AUTO_INCREMENT, `ID1` VARCHAR(10) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci', `ID2` INT(11) NULL DEFAULT NULL, `ACT_TIME` TIMESTAMP NULL DEFAULT NULL, `X` FLOAT(12) NULL DEFAULT NULL, `Y` FLOAT(12) NULL DEFAULT NULL, `Z` FLOAT(12) NULL DEFAULT NULL, PRIMARY KEY (`ID`) USING BTREE, INDEX `ID1` (`ID1`) USING BTREE, INDEX `ACT_TIME` (`ACT_TIME`) USING BTREE ); 

Here's the EXPLAIN {query} result enter image description here

Here's how the query should work.

I want to remove any rows (that have the same ID1 and ID2) that occur later on where X and Y are in between +-2 and a Z less than 1.2*z

Here's an example input

enter image description here

Example output

enter image description here

This query takes about 5min with 2.5M rows.

I am on MariDB 10.5.5

Any and all help is appreciated!

EDIT for Rick James here's your explain {query} here is the explain result enter image description here

5
  • Could you add the EXPLAIN <query>; output too? Commented Oct 6, 2021 at 16:35
  • Any idea what the query should do? Commented Oct 6, 2021 at 17:33
  • @jkavalik Added the result of the EXPLAIN Commented Oct 6, 2021 at 19:11
  • @GerardH.Pille gave an example table and what the output should be Commented Oct 6, 2021 at 19:11
  • Pictures? Great! Perhaps it's possible to extract some test data using OCR. Commented Oct 7, 2021 at 9:00

2 Answers 2

2

Here is an optimized, bug-fixed query. Fiddle for demonstration.

WITH cte1 AS ( SELECT *, (X-2) AS X_START, (X+2) AS X_END, (Y-2) AS Y_START, (Y+2) AS Y_END, (Z*1.2) AS Z_MAX, DENSE_RANK() OVER (PARTITION BY ID1, ID2 ORDER BY ACT_TIME) AS `DENSE_RANK` FROM data ), cte3 AS ( SELECT c.ID FROM cte1 JOIN cte1 c ON (cte1.ID <> c.ID and cte1.X_END >= c.X and c.X >= cte1.X_START and cte1.Y_END >= c.Y and c.Y >= cte1.Y_START and c.Z < cte1.Z*1.2 and c.`DENSE_RANK` > cte1.`DENSE_RANK`) ) SELECT data.* FROM data WHERE NOT EXISTS (SELECT 1 FROM cte3 where cte3.ID=data.ID) ORDER BY data.ACT_TIME 

Issues I addressed:

  • Changed WHERE NOT IN to WHERE NOT EXISTS, IN with a large list is a common performance issue.
  • The optimizer noticed that cte2 was entirely unnecessary, being simple column aliases, so I eliminated it and made cte3 a strict self-join. c is the candidate row.
  • The Z_MAX bound was missing the 1.2 coefficient. Your query returned 2 extra rows from your example output due to this being missing.
  • Removed the ORDER BY in cte1 as it was unnecessary.

The goal here was to get rid of as many sorts as possible and keep down the number of table scans. If your server has especially poor I/O performance it may be worth exploring memory tuning options to keep the sorts from spilling to disk.

PS: Rick James noticed that ACT_TIME could be used as a tiebreaker instead of DENSE_RANK() and if that works for your data then that would get rid of another sort.

2
  • Thank you, with your improvements (haven't done the ACT_TIME part yet) I was able to go from 280s -> 215s Commented Oct 7, 2021 at 16:37
  • For some reason when I remove the DENSE_RANK part the query doesn't seem to finish Commented Oct 7, 2021 at 17:32
0

I don't think you need CTEs. See if something like this will work:

SELECT d1.*, d2.* FROM data d1 JOIN data d2 ON d2.x BETWEEN d1.x - 2 AND d1.x + 2 AND d2.y BETWEEN d1.y - 2 AND d1.y + 2 AND d2.z ... (I don't understand the criteria here) AND d2.act_time < d1.act_time 

Indexes for the optimizer to pick among:

INDEX(x,y,z,act_time) INDEX(y,z,x,act_time) INDEX(z,x,y,act_time) 

Meanwhile, Remove the (12) from FLOAT(12).

I don't see the explanation for whey DENSE_RANK is needed.

5
  • hmm I am not sure why, but when I try do this filter it just never finishes Commented Oct 7, 2021 at 15:56
  • @mike_gundy123 - OK. Could you provide EXPLAIN SELECT for my version (and any future versions you review). Also, what did you do about the d2.z test? Oh, I seem to have lost an operator for act_time;fixing now. Commented Oct 7, 2021 at 16:21
  • ya I did AND d2.Z < d1.Z * 1.2 Commented Oct 7, 2021 at 16:35
  • Added explain result in the post Commented Oct 7, 2021 at 16:55
  • @mike_gundy123 - Thanks for the Explain. As I expected, tt shows a full scan (for d1) and a partial scan of d2. However, It does not seem to have the 3 indexes I suggested. Commented Oct 7, 2021 at 16:58

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.