Background:
I have a query that I'm trying to speed up, involving 3 tables:
omgenvelopehas 5 million rowsomgcusthas 195 rowsomginputhas 35836 rows
I'm obtaining the distinct omginputs referenced by certain omgenvelopes and grabbing some data from omgcust belonging to the omginput.
My two query variants so far:
SELECT customer, custname, idomginput, filename, omginput.laststamp FROM omginput, omgcust WHERE idomginput IN (SELECT DISTINCT(lastinput) FROM omgenvelope WHERE envstate NOT IN (42,46,65,70,250)) AND idomgcust=customer ORDER BY omginput.laststamp, filename;
vs.
SELECT DISTINCT customer, custname, idomginput, filename, omginput.laststamp FROM omgenvelope JOIN omginput ON (idomginput=lastinput AND envstate NOT IN (42,46,65,70,250)) JOIN omgcust ON (idomgcust=customer) ORDER BY omginput.laststamp, filename;
Query plans:
EXPLAIN on the upper query, with the IN (SELECT...) subquery, shows this plan (abbreviated):
select_type: PRIMARY table: omgcust type: ALL possible_keys: PRIMARY key: NULL key_len: NULL ref: NULL rows: 195 Extra: Using temporary; Using filesort -------------- select_type: PRIMARY table: omginput type: ref key: fk_omginput_omgcust1_idx key_len: 4 ref: tracksdb.omgcust.idomgcust rows: 109 Extra: Using where -------------- select_type: DEPENDENT SUBQUERY table: omgenvelope type: index_subquery key: fk_omgenvelope_omginput1_idx key_len: 4 ref: func rows: 867 Extra: Using where
An the new query that I'm, considering to use instead:
select_type: SIMPLE table: omgenvelope type: range key: fk_omgenvelope_omgstate1_idx key_len: 4 ref: NULL rows: 886220 Extra: Using where; Using temporary; Using filesort -------------- select_type: SIMPLE table: omginput type: eq_ref key: PRIMARY key_len: 4 ref: tracksdb.omgenvelope.lastinput rows: 1 -------------- select_type: SIMPLE table: omgcust type: eq_ref key: PRIMARY key_len: 4 ref: tracksdb.omginput.customer rows: 1
Question:
How do I quantify which one is more efficient? The old one with the DEPENDENT SUBQUERY looks pretty simple, and doesn't yield so many rows in each step (195 * 109 * 867). The new candidate on the other hand shows (886220 * 1 * 1) rows.
So my question is how to interpret these estimated numbers of rows. Can I just multiply them and compare the products, or do I need to think more about what the RDBMS is actually doing when executing the queries?