4

The following query finishes in 1.5s (which is ok, the table contains about 500M rows):

explain (analyze, buffers) select sales.* from sales join product on (product.id = sales.productid) join date on (date.id = sales.dateid) where product.id = 24 order by date.timestamp limit 200; 

Query plan: http://explain.depesz.com/s/8Ix

Searching for product.name instead increases the runtime to totally unacceptable 200s:

explain (analyze, buffers) select sales.* from sales join product on (product.id = sales.productid) join date on (date.id = sales.dateid) where product.name = 'new00000006' order by date.timestamp limit 200; 

Query plan: http://explain.depesz.com/s/0RfQ

Note that the product named 'new00000006' has id 24 (same id as in the fast query above). Proof:

select name from product where id = 24; name ------------- new00000006 

Why does that query take 200 times longer than the first query?

Another interesting modification of this query is this.. instead of product.id = 24 (like in the first query), I use product.id = (select 24). This also takes 200s to run (it actually results in the same bad query plan as when searching for product.name):

explain (analyze, buffers) select sales.* from sales join product on (product.id = sales.productid) join date on (date.id = sales.dateid) where product.id = (select 24) order by date.timestamp limit 200; 

Query plan: http://explain.depesz.com/s/K3VO

The statistics table shows that product id 24 is "rare":

select most_common_vals from pg_stats where tablename='sales' and attname='productid'; {19,2,7,39,40,14,33,18,8,37,16,48,6,23,49,29,46,41,20,53,47,26,38,1,32,42,56,57,10,15,27,50,30,45,51,58,17,36,4,25,44,43,5,22,11,35,52,9,21,12,24,31,28,54,34,3,55,13} select most_common_freqs from pg_stats where tablename='sales' and attname='productid'; {0.020225,0.020119,0.0201133,0.0201087,0.0201,0.0200903,0.0200843,0.020069,0.0200557,0.0200477,0.0200427,0.0200303,0.0200197,0.020019,0.020012,0.0200107,0.0200067,0.020006,0.019995,0.0199947,0.0199917,0.019986,0.019986,0.0199777,0.0199747,0.0199713,0.0199693,0.019969,0.019967,0.019962,0.0199607,0.0199603,0.01996,0.0199567,0.0199567,0.0199533,0.019952,0.019951,0.0199467,0.019944,0.019944,0.01993,0.0199297,0.0199257,0.0199223,0.0199143,0.01989,0.0198887,0.019883,0.0198747,6.7e-005,6e-005,5.9e-005,5.6e-005,5.46667e-005,5.43333e-005,5.13333e-005,4.96667e-005} 

Product id 24 has a frequency of 6.7e-005 (it's a "new product"), while older products have frequencies of about 0.01.

Statistics show that the first query plan (the one that runs in 1.5s) makes perfect sense. It uses the sales_productid_index to quickly find the sales of this product. Why isn't the same query plan used in the other two cases? It seems like statistics are ignored.

Table definitions (slightly obfuscated / renamed):

 Tabelle äpublic.salesô Spalte | Typ | Attribute | Speicherung | Statistikziel | Beschreibung -----------+---------+-----------+-------------+---------------+-------------- id | uuid | not null | plain | | dateid | integer | | plain | 10000 | productid | integer | | plain | 10000 | a | text | | extended | 10000 | b | integer | | plain | 10000 | x1 | boolean | | plain | | x2 | boolean | | plain | | x3 | boolean | | plain | | x4 | boolean | | plain | | x5 | boolean | | plain | | Indexe: "sales_pkey" PRIMARY KEY, btree (id) "sales_a_index" btree (a) "sales_b_index" btree (b) "sales_dateid_index" btree (dateid) "sales_productid_index" btree (productid) Fremdschlnssel-Constraints: "sales_dateid_fkey" FOREIGN KEY (dateid) REFERENCES date(id) "sales_productid_fkey" FOREIGN KEY (productid) REFERENCES product(id) Hat OIDs: nein Tabelle äpublic.productô Spalte | Typ | Attribute | Speicherung | Statistikziel | Beschreibung --------+---------+----------------------------------------------------------+-------------+---------------+-------------- id | integer | not null Vorgabewert nextval('product_id_seq'::regclass) | plain | | name | text | | extended | | Indexe: "product_pkey" PRIMARY KEY, btree (id) "product_name_index" UNIQUE, btree (name) Fremdschlnsselverweise von: TABLE "sales" CONSTRAINT "sales_productid_fkey" FOREIGN KEY (productid) REFERENCES product(id) TABLE "salesaggr" CONSTRAINT "salesaggr_productid_fkey" FOREIGN KEY (productid) REFERENCES product(id) Hat OIDs: nein 

Version: PostgreSQL 9.3.1, compiled by Visual C++ build 1600, 64-bit

Config: default configuration except for maintenance_work_mem, which has been increased to 1GB.

Operating system: Microsoft Windows [Version 6.2.9200]

Amount and size of RAM installed: 32GB

Storage: a single 1TB SSD

9
  • Excuse me if I'm wrong, but are you sure this isn't the case? stackoverflow.com/questions/6442046/… Commented Dec 2, 2013 at 15:11
  • Is there an index on product.name? Commented Dec 2, 2013 at 15:11
  • @MikeSherrill'Catcall' yes there is a unique btree index on product.name. But I think it doesn't get used since the product table only contains 60 rows. Commented Dec 2, 2013 at 15:12
  • @foibs I'm not sure. His problem seems to be that the estimated numbers of rows were off.. the estimated number of rows seems to fit rather well in my query plans. I have run analyze and cluster, makes no difference. I'll try reindex during night. Note that all tables are "insert only". There are no updates or deletes.. so there shouldn't be much to vacuum. Commented Dec 2, 2013 at 15:28
  • It is simply unacceptable to post such a question without providing your version of Postgres and (relevant) table definitions. Read this: stackoverflow.com/tags/postgresql-performance/info Commented Dec 2, 2013 at 17:09

1 Answer 1

2

In your first query, the planner takes a shortcut and uses the sales_productid_index available on sales.productid since it is told that sales.productid = product.id. The only thing the join with product actually does in this query is assuring that a row with id = 24 actually exists in the table.

In the second query, this shortcut isn't available. The planner could choose to go to product, get the id and then scan sales using the index on productid, probably getting similar performance, but because he doesn't know that name='new00000006' will lead to id=24, he can't guess how many rows in sales this path would result in*. For all he knows he'd be index scanning a significant part of the 300M rows sales table.

*Note that in the first query, he guesses that productid=24 will result in 42393 rows, while getting 34560 rows. Quite close considering the table has 300M rows.

Sign up to request clarification or add additional context in comments.

1 Comment

The explanation makes sense ("but because he doesn't know that name='new00000006' will lead to id=24, he can't guess how many rows in sales this path would result in"). We now make two queries - first getting the product id by name, than using that id in the real query. Gives much better results (i.e. 1.5s instead of 200s as above). But it seems more like a workaround than a solution. The problem seems to be too easy for the optimizer not being able to figure that out himself...

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.