Optimize query with condition on multiple tables

Question

I have two postgres table

Table A

id	owner_id
1	100
2	101

Table B

id	a_id	user_id
1	1	200
2	1	201
3	2	202
4	2	201

id on both tables are PK and integer

I have B-Tree index on a.owner_id , b.a_id , b.user_id

First Query

In the following query

SELECT b.id FROM b JOIN a ON b.a_id = a.id WHERE b.user_id = 201 OR a.owner_id = 100 LIMIT 50;

I have WHERE b.user_id = 201 OR a.owner_id = 100 condition , index for b.user_id is used by query plan but index for a.owner_id is not used, Here is query plan

QUERY PLAN Limit (cost=19.54..4445.84 rows=50 width=4) (actual time=0.125..5.031 rows=50 loops=1) Buffers: shared hit=1054 -> Merge Join (cost=19.54..9815083.22 rows=110872 width=4) (actual time=0.123..5.018 rows=50 loops=1) Merge Cond: (a.id = b.a_id) Join Filter: ((b.user_id = 201) OR (a.owner_id = 100)) Rows Removed by Join Filter: 5547 Buffers: shared hit=1054 -> Index Scan using a_pkey on a (cost=0.42..103568.63 rows=100009 width=20) (actual time=0.011..0.037 rows=50 loops=1) Buffers: shared hit=10 -> Index Scan using b_a_id on b (cost=0.43..9515274.99 rows=11200116 width=24) (actual time=0.009..3.136 rows=5597 loops=1) Buffers: shared hit=1044 Planning Time: 0.626 ms Execution Time: 5.082 ms

The query is a little slow, How can I make it faster?

Second Query

Also have another slower query

SELECT b.id FROM b JOIN a ON b.a_id = a.id WHERE (b.user_id = 201 AND a.owner_id = 100) OR (b.user_id = 100 AND a.owner_id = 201) LIMIT 50;

QUERY PLAN Limit (cost=1000.43..19742.38 rows=50 width=4) (actual time=0.705..63.142 rows=50 loops=1) Buffers: shared hit=1419 read=3994 -> Gather (cost=1000.43..75593.36 rows=199 width=4) (actual time=0.704..63.124 rows=50 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=1419 read=3994 -> Nested Loop (cost=0.43..74573.46 rows=83 width=4) (actual time=0.752..13.122 rows=17 loops=3) Buffers: shared hit=1419 read=3994 -> Parallel Seq Scan on a (cost=0.00..25628.06 rows=83 width=20) (actual time=0.669..11.868 rows=17 loops=3) Filter: ((owner_id = 100) OR (owner_id = 201)) Rows Removed by Filter: 16985 Buffers: shared hit=258 read=3994 -> Index Scan using b_a_id on b (cost=0.43..589.69 rows=1 width=24) (actual time=0.023..0.070 rows=1 loops=52) Index Cond: (a_id = a.id) Filter: (((user_id = 201) OR (user_id = 100)) AND (((user_id = 201) AND (a.owner_id = 100)) OR ((a.owner_id = 201) AND (user_id = 100)))) Rows Removed by Filter: 105 Buffers: shared hit=1161 Planning Time: 0.638 ms Execution Time: 63.202 ms

The plan looks quite reasonable, but for a more detailed analysis, we would need to see the plan generated using explain (analyze, buffers, format text) - not just a "simple" explain — user330315
– user330315, Commented Feb 20, 2021 at 7:27
5 milliseconds doesn't qualify as "a little slow to me". How fast do you need that to be? — user330315
– user330315, Commented Feb 20, 2021 at 8:03
Try to change OR condition to UNION maybe it will be faster — Rostyslav
– Rostyslav, Commented Feb 20, 2021 at 8:07
@a_horse_with_no_name yes, 5 milliseconds is not slow, But when I want to fetch next pages with OFFSET it is very slow, it takes >10 seconds — A.A
– A.A, Commented Feb 20, 2021 at 8:27

bobflux · Accepted Answer · 2021-02-20 10:05:03Z

Create test data...

CREATE UNLOGGED TABLE a AS SELECT a_id, (random()*100000)::INTEGER owner_id FROM generate_series(1,1000000) a_id; CREATE UNLOGGED TABLE b AS SELECT b_id, (random()*100000)::INTEGER a_id, (random()*100000)::INTEGER user_id FROM generate_series(1,10000000) b_id; CREATE INDEX a_o ON a(owner_id); CREATE INDEX b_a ON b(a_id); CREATE INDEX b_u ON b(user_id); ALTER TABLE a ADD PRIMARY KEY(a_id); ALTER TABLE b ADD PRIMARY KEY(b_id); VACUUM ANALYZE a,b;

Problem with first query is postgres doesn't know how to optimize star join, so we have to give it a little help.

WITH ids AS ( SELECT a_id FROM b WHERE user_id=201 UNION SELECT a_id FROM a WHERE owner_id=100 ) SELECT * FROM ids JOIN b USING (a_id) LIMIT 50;

This gives a plan using both indices, it may be faster in your case, or maybe not.

 Limit (cost=455.41..634.97 rows=50 width=12) (actual time=0.494..0.642 rows=50 loops=1) -> Nested Loop (cost=455.41..41596.19 rows=11456 width=12) (actual time=0.492..0.629 rows=50 loops=1) -> HashAggregate (cost=450.19..451.32 rows=113 width=4) (actual time=0.425..0.427 rows=1 loops=1) Group Key: b_1.a_id Batches: 1 Memory Usage: 24kB -> Append (cost=5.23..449.91 rows=113 width=4) (actual time=0.076..0.358 rows=98 loops=1) -> Bitmap Heap Scan on b b_1 (cost=5.23..401.21 rows=102 width=4) (actual time=0.075..0.299 rows=92 loops=1) Recheck Cond: (user_id = 201) Heap Blocks: exact=92 -> Bitmap Index Scan on b_u (cost=0.00..5.20 rows=102 width=0) (actual time=0.035..0.035 rows=92 loops=1) Index Cond: (user_id = 201) -> Bitmap Heap Scan on a (cost=4.51..47.00 rows=11 width=4) (actual time=0.019..0.033 rows=6 loops=1) Recheck Cond: (owner_id = 100) Heap Blocks: exact=6 -> Bitmap Index Scan on a_o (cost=0.00..4.51 rows=11 width=0) (actual time=0.014..0.014 rows=6 loops=1) Index Cond: (owner_id = 100) -> Bitmap Heap Scan on b (cost=5.22..363.09 rows=101 width=12) (actual time=0.059..0.174 rows=50 loops=1) Recheck Cond: (a_id = b_1.a_id) Heap Blocks: exact=50 -> Bitmap Index Scan on b_a (cost=0.00..5.19 rows=101 width=0) (actual time=0.023..0.023 rows=104 loops=1) Index Cond: (a_id = b_1.a_id) Planning Time: 0.448 ms Execution Time: 0.747 ms

As for the other query, I had to run this:

select owner_id, user_id, count(*) from a join b using (a_id) group by owner_id,user_id order by count(*) desc limit 100;

to get some user_id,owner_id that would actually return results from my test data. Then,

EXPLAIN ANALYZE SELECT b.* FROM b JOIN a USING (a_id) WHERE (b.user_id = 99238 AND a.owner_id = 58599) OR (b.user_id = 36859 AND a.owner_id = 99027) LIMIT 50; Limit (cost=24.97..532.32 rows=1 width=12) (actual time=0.274..0.982 rows=6 loops=1) -> Nested Loop (cost=24.97..532.32 rows=1 width=12) (actual time=0.271..0.976 rows=6 loops=1) -> Bitmap Heap Scan on a (cost=9.03..92.70 rows=22 width=8) (actual time=0.108..0.216 rows=12 loops=1) Recheck Cond: ((owner_id = 58599) OR (owner_id = 99027)) Heap Blocks: exact=12 -> BitmapOr (cost=9.03..9.03 rows=22 width=0) (actual time=0.086..0.088 rows=0 loops=1) -> Bitmap Index Scan on a_o (cost=0.00..4.51 rows=11 width=0) (actual time=0.064..0.065 rows=3 loops=1) Index Cond: (owner_id = 58599) -> Bitmap Index Scan on a_o (cost=0.00..4.51 rows=11 width=0) (actual time=0.020..0.020 rows=9 loops=1) Index Cond: (owner_id = 99027) -> Bitmap Heap Scan on b (cost=15.95..19.97 rows=1 width=12) (actual time=0.058..0.060 rows=0 loops=12) Recheck Cond: ((a_id = a.a_id) AND ((user_id = 99238) OR (user_id = 36859))) Filter: (((user_id = 99238) AND (a.owner_id = 58599)) OR ((user_id = 36859) AND (a.owner_id = 99027))) Heap Blocks: exact=6 -> BitmapAnd (cost=15.95..15.95 rows=1 width=0) (actual time=0.053..0.053 rows=0 loops=12) -> Bitmap Index Scan on b_a (cost=0.00..5.19 rows=101 width=0) (actual time=0.015..0.015 rows=50 loops=12) Index Cond: (a_id = a.a_id) -> BitmapOr (cost=10.50..10.50 rows=205 width=0) (actual time=0.046..0.046 rows=0 loops=6) -> Bitmap Index Scan on b_u (cost=0.00..5.20 rows=102 width=0) (actual time=0.021..0.021 rows=121 loops=6) Index Cond: (user_id = 99238) -> Bitmap Index Scan on b_u (cost=0.00..5.20 rows=102 width=0) (actual time=0.024..0.024 rows=105 loops=6) Index Cond: (user_id = 36859) Planning Time: 0.703 ms Execution Time: 1.063 ms

It doesn't use a seq scan as yours do, so maybe you have an old version that couldn't optimize this properly? It's quite weird that it picks a seq scan for table a when the row count estimates are pretty accurate. You should investigate, maybe try

SELECT * FROM a WHERE a.owner_id = 58599 OR a.owner_id = 99027 LIMIT 50;

this should give an index or bitmap index scan, if it does a seq scan, then you have a small test case to find out why. Anyway, you can still force use of indices with:

EXPLAIN ANALYZE WITH ids AS ( SELECT a_id FROM b WHERE user_id IN (99238,36859) UNION SELECT a_id FROM a WHERE owner_id IN (58599,99027) ) SELECT * FROM ids JOIN b USING (a_id) JOIN a USING (a_id) WHERE (b.user_id = 99238 AND a.owner_id = 58599) OR (b.user_id = 36859 AND a.owner_id = 99027);

...but it's pretty damn ugly. Or you could do each clause in your OR separately and do the AND manyally with this one, which is also ugly:

EXPLAIN ANALYZE SELECT a_id FROM b WHERE b.user_id = 99238 INTERSECT SELECT a_id FROM a WHERE a.owner_id = 58599 LIMIT 50;

How can I optimize large OFFSETs

You don't, in fact when large offsets are used it usually hints that you're doing it wrong,by repeatedly doing the same query, for example for pagination, and displaying chunks of results. There are two solutions. If results will be fetched quickly enough that a transaction can stay open while you do that, open a cursor for the query without LIMIT or OFFSET and use FETCH to get results in chunks. Otherwise, do the query once without LIMIT, store the results in a cache, and paginate from that without redoing the query.

Thanks for your awesome solution, Honestly I cannot understand how it works after reading it twice, I must play with it more
It's a join with conditions on both tables, and an effective way to do it is to get the ids of the rows from both tables that satisfy the condition in each table, and then union or intersect those two sets, which gives the rows that satisfy either one or both conditions.

Laurenz Albe · Accepted Answer · 2021-02-20 09:36:16Z

1

Use UNION rather than OR:

SELECT * FROM ((SELECT b.id FROM b JOIN a ON b.a_id = a.id WHERE b.user_id = 201 LIMIT 50) UNION (SELECT b.id FROM b JOIN a ON b.a_id = a.id WHERE a.owner_id = 100 LIMIT 50)) AS q LIMIT 50;

Indexes on a(owner_id), a(id), b(user_id) and b(a_id) will make it fast.

answered Feb 20, 2021 at 9:36

Laurenz Albe

257k22 gold badges312 silver badges388 bronze badges

4 Comments

A.A Over a year ago

You have 3 LIMIT 50, How can I get next 50s result?

Laurenz Albe Over a year ago

Each branch of the UNION fetches at most 50 rows, and the final LIMIT makes sure that no more than 50 are returned.

A.A Over a year ago

If we set before OFSSET 50 final LIMIT, as branch of the UNION fetches at most 50 rows, iteration over all rows it not possible and we skip some rows

Laurenz Albe Over a year ago

I don't see how OFFSET could come into play here.

Collectives™ on Stack Overflow

Optimize query with condition on multiple tables

Table A

Table B

First Query

Second Query

2 Answers 2

2 Comments

4 Comments

Hot Network Questions

Collectives™ on Stack Overflow

Table A

Table B

First Query

Second Query

2 Answers 2

2 Comments

4 Comments

Related