Consider the following data model in a PostgreSQL v13 system;
Here, parent table dim contains a small set of reference data, and child table fact contains a much higher volume of records. A typical use case for these data sets would be to query all fact::value's data belonging to a dim::name. Note that dim::name holds a UNIQUE constraint.
While I think this is a very common scenario, I was somewhat taken aback that the style of queries I've been using for years on other RDBMS's (Oracle, MSSQL) didn't perform at all on PostgreSQL the way I imagined they would. That is, when querying a dataset (fact) using a highly selective, but implicit, predicate (fact::dim_id eq X) through a join condition, I expect the index on fact::dim_id to be used (in a nested-loop). Instead, a hash-join is used, requiring a full table scan of fact.
Question: is there some way I can nudge the query planner into considering any predicate I issue on a joined relation to not need a full table scan? (without impacting other DB loads)
To illustrate the problem with an example, these tables are populated with some random data;
CREATE TABLE dim( id SERIAL NOT NULL , name TEXT NOT NULL , CONSTRAINT pk_dim PRIMARY KEY (id) , CONSTRAINT uq_dim UNIQUE (name) ); CREATE TABLE fact( id SERIAL NOT NULL , dim_id INTEGER NOT NULL , value TEXT , CONSTRAINT pk_fact PRIMARY KEY (id) , CONSTRAINT fk_facts_dim FOREIGN KEY (dim_id) REFERENCES dim (id) ); CREATE INDEX idx_fact_dim ON fact(dim_id); INSERT INTO dim(name) SELECT SUBSTRING(md5(random()::TEXT) FOR 5) FROM generate_series(1,50) UNION SELECT 'key'; INSERT INTO fact(dim_id, value) SELECT (SELECT id FROM dim ORDER BY random() LIMIT 1) , md5(random()::TEXT) FROM generate_series(1,1000000); ANALYZE dim; ANALYZE fact; EXPLAIN ANALYZE SELECT f.* FROM fact AS f JOIN dim AS d ON (d.id = f.dim_id) WHERE d.name = 'key'; -- Note: UNIQUE QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=1001.65..18493.29 rows=20588 width=41) (actual time=319.331..322.582 rows=0 loops=1) Workers Planned: 2 Workers Launched: 2 -> Hash Join (cost=1.65..15434.49 rows=8578 width=41) (actual time=306.193..306.195 rows=0 loops=3) Hash Cond: (f.dim_id = d.id) -> Parallel Seq Scan on fact f (cost=0.00..14188.98 rows=437498 width=41) (actual time=0.144..131.050 rows=350000 loops=3) -> Hash (cost=1.64..1.64 rows=1 width=4) (actual time=0.138..0.139 rows=1 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Seq Scan on dim d (cost=0.00..1.64 rows=1 width=4) (actual time=0.099..0.109 rows=1 loops=3) Filter: (name = 'key'::text) Rows Removed by Filter: 50 Planning Time: 1.059 ms Execution Time: 322.662 ms Now, we execute the same question, but instead of filtering using an inner join, we filter using a scalar subquery;
EXPLAIN ANALYZE SELECT * FROM fact WHERE dim_id = (SELECT id FROM dim WHERE name = 'key'); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Index Scan using idx_fact_dim on fact (cost=2.07..15759.53 rows=524998 width=41) (actual time=0.096..0.097 rows=0 loops=1) Index Cond: (dim_id = $0) InitPlan 1 (returns $0) -> Seq Scan on dim (cost=0.00..1.64 rows=1 width=4) (actual time=0.046..0.054 rows=1 loops=1) Filter: (name = 'key'::text) Rows Removed by Filter: 50 Planning Time: 0.313 ms Execution Time: 0.156 ms As shown, the performance difference is huge. Somehow, the query planner did not consider the predicate on the unique dim::name attribute to be equal to a predicate on fact::dim_id in the first query.
