PostgreSQL predicate not pushed down (through join conditions)

Question

Consider the following data model in a PostgreSQL v13 system;

Here, parent table dim contains a small set of reference data, and child table fact contains a much higher volume of records. A typical use case for these data sets would be to query all fact::value's data belonging to a dim::name. Note that dim::name holds a UNIQUE constraint.

While I think this is a very common scenario, I was somewhat taken aback that the style of queries I've been using for years on other RDBMS's (Oracle, MSSQL) didn't perform at all on PostgreSQL the way I imagined they would. That is, when querying a dataset (fact) using a highly selective, but implicit, predicate (fact::dim_id eq X) through a join condition, I expect the index on fact::dim_id to be used (in a nested-loop). Instead, a hash-join is used, requiring a full table scan of fact.

Question: is there some way I can nudge the query planner into considering any predicate I issue on a joined relation to not need a full table scan? (without impacting other DB loads)

To illustrate the problem with an example, these tables are populated with some random data;

CREATE TABLE dim( id SERIAL NOT NULL , name TEXT NOT NULL , CONSTRAINT pk_dim PRIMARY KEY (id) , CONSTRAINT uq_dim UNIQUE (name) ); CREATE TABLE fact( id SERIAL NOT NULL , dim_id INTEGER NOT NULL , value TEXT , CONSTRAINT pk_fact PRIMARY KEY (id) , CONSTRAINT fk_facts_dim FOREIGN KEY (dim_id) REFERENCES dim (id) ); CREATE INDEX idx_fact_dim ON fact(dim_id); INSERT INTO dim(name) SELECT SUBSTRING(md5(random()::TEXT) FOR 5) FROM generate_series(1,50) UNION SELECT 'key'; INSERT INTO fact(dim_id, value) SELECT (SELECT id FROM dim ORDER BY random() LIMIT 1) , md5(random()::TEXT) FROM generate_series(1,1000000); ANALYZE dim; ANALYZE fact;

EXPLAIN ANALYZE SELECT f.* FROM fact AS f JOIN dim AS d ON (d.id = f.dim_id) WHERE d.name = 'key'; -- Note: UNIQUE QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1001.65..18493.29 rows=20588 width=41) (actual time=319.331..322.582 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Hash Join (cost=1.65..15434.49 rows=8578 width=41) (actual time=306.193..306.195 rows=0 loops=3) Hash Cond: (f.dim_id = d.id) -> Parallel Seq Scan on fact f (cost=0.00..14188.98 rows=437498 width=41) (actual time=0.144..131.050 rows=350000 loops=3) -> Hash (cost=1.64..1.64 rows=1 width=4) (actual time=0.138..0.139 rows=1 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on dim d (cost=0.00..1.64 rows=1 width=4) (actual time=0.099..0.109 rows=1 loops=3) Filter: (name = 'key'::text) Rows Removed by Filter: 50 Planning Time: 1.059 ms Execution Time: 322.662 ms

Now, we execute the same question, but instead of filtering using an inner join, we filter using a scalar subquery;

EXPLAIN ANALYZE SELECT * FROM fact WHERE dim_id = (SELECT id FROM dim WHERE name = 'key'); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_fact_dim on fact (cost=2.07..15759.53 rows=524998 width=41) (actual time=0.096..0.097 rows=0 loops=1) Index Cond: (dim_id = $0) InitPlan 1 (returns $0) -> Seq Scan on dim (cost=0.00..1.64 rows=1 width=4) (actual time=0.046..0.054 rows=1 loops=1) Filter: (name = 'key'::text) Rows Removed by Filter: 50 Planning Time: 0.313 ms Execution Time: 0.156 ms

As shown, the performance difference is huge. Somehow, the query planner did not consider the predicate on the unique dim::name attribute to be equal to a predicate on fact::dim_id in the first query.

paperskilltrees · Accepted Answer · 2024-10-14 16:23:46Z

I don't know how to nudge the planner to do the right thing in your example, and I agree that the filter condition should be pushed inside join. However, there is a mistake in your database population code, and on realistic data the planner does a decent job and indeed pushes the condition through join (Postgres 16.3). This is too long for a comment, hence posting as a partial answer.

The subquery in populating fact table is decoupled from the outer query and executed only once:

INSERT INTO fact(dim_id, value) SELECT (SELECT id FROM dim ORDER BY random() LIMIT 1) , md5(random()::TEXT) FROM generate_series(1,1000000);

This issue was definitely present in your analysis, because your query plans returned 0 rows.

To make the code do what we want it to do, I added a spurious coupling (if you know a better way, please share):

INSERT INTO fact(dim_id, value) SELECT (SELECT id FROM dim ORDER BY random() + 0*i LIMIT 1) , md5(random()::TEXT) FROM generate_series(1,1000000) i;

Now the "slow" and the "fast" queries are roughly equivalent:

SELECT f.* FROM fact AS f JOIN dim AS d ON (d.id = f.dim_id) WHERE d.name = 'key'; -- Note: UNIQUE

 QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Nested Loop (cost=174.49..9963.30 rows=19608 width=41) (actual time=2.498..21.194 rows=19661 loops=1) -> Seq Scan on dim d (cost=0.00..1.64 rows=1 width=4) (actual time=0.012..0.018 rows=1 loops=1) Filter: (name = 'key'::text) Rows Removed by Filter: 50 -> Bitmap Heap Scan on fact f (cost=174.49..9765.59 rows=19608 width=41) (actual time=2.481..17.209 rows=19661 loops=1) Recheck Cond: (dim_id = d.id) Heap Blocks: exact=8252 -> Bitmap Index Scan on idx_fact_dim (cost=0.00..169.59 rows=19608 width=0) (actual time=1.262..1.262 rows=19661 loops=1) Index Cond: (dim_id = d.id) Planning Time: 0.843 ms Execution Time: 21.998 ms (11 rows)

compare to:

SELECT * FROM fact WHERE dim_id = (SELECT id FROM dim WHERE name = 'key');

 QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on fact (cost=176.12..9767.22 rows=19608 width=41) (actual time=2.782..15.379 rows=19661 loops=1) Recheck Cond: (dim_id = $0) Heap Blocks: exact=8252 InitPlan 1 (returns $0) -> Seq Scan on dim (cost=0.00..1.64 rows=1 width=4) (actual time=0.008..0.012 rows=1 loops=1) Filter: (name = 'key'::text) Rows Removed by Filter: 50 -> Bitmap Index Scan on idx_fact_dim (cost=0.00..169.59 rows=19608 width=0) (actual time=1.370..1.371 rows=19661 loops=1) Index Cond: (dim_id = $0) Planning Time: 0.255 ms Execution Time: 16.174 ms (11 rows)```

Stack Exchange Network

PostgreSQL predicate not pushed down (through join conditions)

1 Answer 1

Hot Network Questions

PostgreSQL predicate not pushed down (through join conditions)

1 Answer 1

Related

Hot Network Questions