I've got a table users with fields id and email. id is the primary key and email is indexed as well.
database> \d users +-----------------------------+-----------------------------+-----------------------------------------------------+ | Column | Type | Modifiers | |-----------------------------+-----------------------------+-----------------------------------------------------| | id | integer | not null default nextval('users_id_seq'::regclass) | | email | character varying | | +-----------------------------+-----------------------------+-----------------------------------------------------+ Indexes: "users_pkey" PRIMARY KEY, btree (id) "index_users_on_email" UNIQUE, btree (email) If I query the table with a distinct on (email) clause in a subquery I get a significant performance penalty.
database> explain (analyze, buffers) select id from ( select distinct on (email) id from users ) as t where id = 123 +-----------------------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | |-----------------------------------------------------------------------------------------------------------------------------| | Subquery Scan on t (cost=8898.69..10077.84 rows=337 width=4) (actual time=221.133..250.782 rows=1 loops=1) | | Filter: (t.id = 123) | | Rows Removed by Filter: 67379 | | Buffers: shared hit=2824, temp read=288 written=289 | | -> Unique (cost=8898.69..9235.59 rows=67380 width=24) (actual time=221.121..247.582 rows=67380 loops=1) | | Buffers: shared hit=2824, temp read=288 written=289 | | -> Sort (cost=8898.69..9067.14 rows=67380 width=24) (actual time=221.120..239.573 rows=67380 loops=1) | | Sort Key: users.email | | Sort Method: external merge Disk: 2304kB | | Buffers: shared hit=2824, temp read=288 written=289 | | -> Seq Scan on users (cost=0.00..3494.80 rows=67380 width=24) (actual time=0.009..9.714 rows=67380 loops=1) | | Buffers: shared hit=2821 | | Planning Time: 0.243 ms | | Execution Time: 251.258 ms | +-----------------------------------------------------------------------------------------------------------------------------+ Compare this with distinct on (id) whose cost is less than one thousandth of the previous query.
database> explain (analyze, buffers) select id from ( select distinct on (id) id from users ) as t where id = 123 +-----------------------------------------------------------------------------------------------------------------------------+ | QUERY PLAN | |-----------------------------------------------------------------------------------------------------------------------------| | Unique (cost=0.29..8.31 rows=1 width=4) (actual time=0.021..0.022 rows=1 loops=1) | | Buffers: shared hit=3 | | -> Index Only Scan using users_pkey on users (cost=0.29..8.31 rows=1 width=4) (actual time=0.020..0.020 rows=1 loops=1) | | Index Cond: (id = 123) | | Heap Fetches: 1 | | Buffers: shared hit=3 | | Planning Time: 0.090 ms | | Execution Time: 0.034 ms | +-----------------------------------------------------------------------------------------------------------------------------+ Why is this?
The real problem I'm having is that I'm trying to create a view that does distinct on an indexed column that isn't unique and the performance is very bad.
distinctsince you're going to discard the results. In the last case the server was able to detect this since only the uniqueidcolumn is involved and simply discard thatDISTINCToperation. In the previous case it had to calculatedistinct onto find whichIDvalues are returned before it can filter themidless than123for an email. In the second case there's always a result as aDISTINCTon aUNIQUEindex just returns the indexed valuesexplain (analyze, buffers)