Improving efficiency of a self join in postgresql

Question

I am performing the following query with a self join:

with t as ( SELECT *, TIMESTAMP 'epoch' + tstamp * INTERVAL '1 second' as tstamp2 FROM mytable WHERE id = 'a' LIMIT 1000 ) select v1.id as id, date_trunc('hour', v1.tstamp2) as hour, v1.value as start, v2.value as stop from t v1 join t v2 on v1.id = v2.id and date_trunc('hour', v1.tstamp2) = date_trunc('hour', v2.tstamp2) and v1.tstamp2 < v2.tstamp2 where 1=1 limit 100;

The table looks like that:

id tstamp value tstamp2

My goal is to output all the combination of "value" within the same hour for one id. I have 100.000 unique ids and millions of rows. This is extremely slow and inefficient. Is there a way to break the query so the self join operates on time partitions (hour by hour for example) to improve speed of such query?

I have 100.000 unique ids and millions of rows.

EDIT: I found this which seems to be what I want to do but no idea how to implement that:

If you know more than you've let on about the properties of the intervals, you might be able to improve things. For instance if the intervals fall into nonoverlapping buckets then you could add a constraint that the buckets of the two sides are equal. Postgres is a lot better with equality join constraints than it is with range constraints, so it would be able to match up rows and only do the O(N^2) work within each bucket.

First rule of efficiency: remove the CTE, replace it by a temp view or two identical subqueries. BTW: LIMIT 1000 without ORDER BY makes no sense at all. — wildplasser
– wildplasser, Commented Jun 17, 2018 at 18:43
Thanks, temp tables might be the best, 2 subqueries might overload memory, yes you are right for the limit, just that by luck when I previewed my data I could evaluate results like that without having the cost of the order by :) but just luck here — Vincent Teyssier
– Vincent Teyssier, Commented Jun 17, 2018 at 18:45
Temp tables are terrible. 2 subqueries might overload memory, Huh? — wildplasser
– wildplasser, Commented Jun 17, 2018 at 18:46
@wildplasser do you think a stored procedure where I first store all my unique ids and have a for loop evaluate the valid pairs of values for each id would be better ? or for loops are inefficient by nature ? — Vincent Teyssier
– Vincent Teyssier, Commented Jun 17, 2018 at 18:47
In general set operations are preferable over any procedural approach regarding performance. A loop is more likely to slow things down than the other way round. — sticky bit
– sticky bit, Commented Jun 17, 2018 at 18:57

Gordon Linoff · Accepted Answer · 2018-06-17 19:03:20Z

This answers the question as originally tagged -- "Postgres", not "Redshift".

Unfortunately, Postgres materializes CTEs, which then precludes the use of indexes. You have no ORDER BY in the CTE, so arbitrary rows are being chosen.

One solution is a temporary table and indexes:

CREATE TEMPORARY TABLE t as SELECT t.*, TIMESTAMP 'epoch' + tstamp * INTERVAL '1 second' as tstamp2, DATE_TRUNC('hour', 'epoch' + tstamp * INTERVAL '1 second') as tstamp2_hour FROM mytable t WHERE t.id = 'a' LIMIT 1000; CREATE INDEX t_id_hour_tstamp2 ON t(id, tstamp2_hour, tstamp2); select v1.id as id, v1.tstamp2_hour as hour, v1.value as start, v2.value as stop from t v1 join t v2 on v1.id = v2.id and v1.tstamp2_hour = v2.tstamp2_hour and v1.tstamp2 < v2.tstamp2 limit 100;

A temp VIEW is better, because it allows the planner to use the indexes (if any).
Thanks a lot for your answer. Unfortunately I am using Redshift, therefore there is no real indexing :/

Collectives™ on Stack Overflow

Improving efficiency of a self join in postgresql

1 Answer 1

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Related