I have a table market_trades with a timestamp column:
api_production=# \d market_trades; Table "public.market_trades" Column | Type | Modifiers ----------------+-----------------------------+------------------------------------------------------------ id | integer | not null default nextval('market_trades_id_seq'::regclass) market_id | integer | not null timestamp | timestamp without time zone | not null amount | numeric(16,8) | not null price | numeric(16,8) | not null created_at | timestamp without time zone | local_id | integer | not null Indexes: "market_trades_pkey" PRIMARY KEY, btree (id) "market_trades__uniqueness" UNIQUE, btree (market_id, "timestamp", local_id) which has about 30 million rows.
In the simplest case, I need to take a time range and sub-interval length, break the time range into sub-intervals of the given length, and count the number of trades with timestamp in each sub-interval.
For example, if the time range is [2013-01-01 00:00, 2013-02-01 00:00) and the sub-interval length is 1 day, then my result set will include 31 rows (one for each day), and will list the total number of records for that day.
I already designed a query to do this:
WITH vals AS ( SELECT '2013-08-19 0:00'::timestamp AS frame_start, '2013-08-26 0:00'::timestamp AS frame_end, '17h'::interval AS interval_length ), intervals AS ( SELECT tsrange(start_time, lead(start_time, 1, frame_end) OVER (ORDER BY start_time NULLS FIRST)) AS time_range FROM ( SELECT generate_series(frame_start, frame_end, interval_length) AS start_time, frame_end FROM vals ) _ WHERE start_time < frame_end ) SELECT time_range, count(td.id) AS agg FROM intervals i LEFT JOIN market_trades td ON td.timestamp <@ i.time_range GROUP BY time_range ORDER BY time_range; which outputs
time_range | agg -----------------------------------------------+------- ["2013-08-19 00:00:00","2013-08-19 17:00:00") | 10852 ["2013-08-19 17:00:00","2013-08-20 10:00:00") | 4727 ["2013-08-20 10:00:00","2013-08-21 03:00:00") | 7875 ["2013-08-21 03:00:00","2013-08-21 20:00:00") | 12137 ["2013-08-21 20:00:00","2013-08-22 13:00:00") | 8454 ["2013-08-22 13:00:00","2013-08-23 06:00:00") | 10284 ["2013-08-23 06:00:00","2013-08-23 23:00:00") | 8385 ["2013-08-23 23:00:00","2013-08-24 16:00:00") | 4978 ["2013-08-24 16:00:00","2013-08-25 09:00:00") | 4709 ["2013-08-25 09:00:00","2013-08-26 00:00:00") | 8620 (10 rows) Time: 64433.256 ms The problem is that it is VERY slow. (As you can see, that query took over a minute to run!). So, naturally, I tried adding a btree index on the timestamp column. But this didn't help at all. Also, I've found that the time it takes is proportional to the number of sub-intervals (i.e. doubling both the overall time range and the interval length takes about the same amount of time as the original).
Here is the output of EXPLAIN ANALYZE on the above query:
QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------- Sort (cost=198931037.42..198931037.92 rows=200 width=36) (actual time=460481.744..460481.750 rows=10 loops=1) Sort Key: i.time_range Sort Method: quicksort Memory: 25kB CTE vals -> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.002..0.003 rows=1 loops=1) CTE intervals -> WindowAgg (cost=31.47..38.13 rows=333 width=16) (actual time=0.057..0.174 rows=10 loops=1) -> Sort (cost=31.47..32.30 rows=333 width=16) (actual time=0.047..0.063 rows=10 loops=1) Sort Key: _.start_time Sort Method: quicksort Memory: 25kB -> Subquery Scan on _ (cost=0.00..17.52 rows=333 width=16) (actual time=0.012..0.034 rows=10 loops=1) Filter: (_.start_time < _.frame_end) -> CTE Scan on vals (cost=0.00..5.02 rows=1000 width=32) (actual time=0.009..0.019 rows=10 loops=1) -> HashAggregate (cost=198930989.64..198930991.64 rows=200 width=36) (actual time=460481.713..460481.721 rows=10 loops=1) -> Nested Loop Left Join (cost=0.00..198881140.83 rows=9969761 width=36) (actual time=25612.620..460406.061 rows=81021 loops=1) Join Filter: (td."timestamp" <@ i.time_range) Rows Removed by Join Filter: 299306019 -> CTE Scan on intervals i (cost=0.00..6.66 rows=333 width=32) (actual time=0.060..0.221 rows=10 loops=1) -> Materialize (cost=0.00..875147.34 rows=29939223 width=12) (actual time=0.006..24275.128 rows=29938704 loops=10) -> Seq Scan on market_trades td (cost=0.00..579263.23 rows=29939223 width=12) (actual time=0.008..22103.565 rows=29938704 loops=1) Total runtime: 460617.278 ms (21 rows) Any suggestions?
EDIT: The problem is clearly that it's checking td.timestamp <@ i.time_range for every pair of a trade and an interval. There must be a better way to filter records within a timestamp range.
EDIT 2: If I change the line
ON td.timestamp <@ i.time_range to
ON td.timestamp >= lower(i.time_range) AND td.timestamp < upper(i.time_range) then it executes MUCH faster, so long as timestamp is indexed.
But this is still not ideal, because the nice thing about time ranges is that they properly handle NULL bounds (no lower bound or no upper bound), and checking for those NULL conditions manually in the ON clause slows down the query even more than the original.
Is there an indexing strategy (or better query structure) that makes working with time ranges as fast as start/end points?