4

Given the following table structure:

CREATE TABLE user ( uid INT(11) auto_increment, name VARCHAR(200), PRIMARY KEY(uid) ); CREATE TABLE user_profile( uid INT(11), address VARCHAR(200), PRIMARY KEY(uid), INDEX(address) ); 

Which join query is more efficient: #1,

SELECT u.name FROM user u INNER JOIN user_profile p ON u.uid = p.uid WHERE p.address = 'some constant' 

or #2:

SELECT u.name FROM user u INNER JOIN (SELECT uid FROM user_profile WHERE p.address = 'some constant') p ON u.uid = p.uid 

How much is the difference in efficiency?

2
  • I'd be curious to know the difference between #2 and this query, or, if there would be any difference: SELECT name FROM user WHERE uid IN (SELECT uid FROM user_profile WHERE address = 'some constant'); Commented Nov 12, 2009 at 22:31
  • 2
    EXPLAIN both? I would guess the former as I've never seen great performance from sub-selects in MySQL. Commented Nov 12, 2009 at 22:31

4 Answers 4

7

The first syntax is generally more efficient.

MySQL buffers the derived queries so using the derived query robs the user_profile of possibility to be a driven table in the join.

Even if the user_profile is leading, the subquery results should be buffered first which implies a memory and performance impact.

A LIMIT applied to the queries will make the first query much faster which is not true for the second one.

Here are the sample plans. There is an index on (val, nid) in the table t_source:

First query:

EXPLAIN SELECT * FROM t_source s1 JOIN t_source s2 ON s2.nid = s1.id WHERE s2.val = 1 1, 'SIMPLE', 's1', 'ALL', 'PRIMARY', '', '', '', 1000000, '' 1, 'SIMPLE', 's2', 'ref', 'ix_source_val,ix_source_val_nid,ix_source_vald_nid', 'ix_source_val_nid', '8', 'const,test.s1.id', 1, 'Using where' 

Second query:

EXPLAIN SELECT * FROM t_source s1 JOIN ( SELECT nid FROM t_source s2 WHERE val = 1 ) q ON q.nid = s1.id 1, 'PRIMARY', '<derived2>', 'ALL', '', '', '', '', 100000, '' 1, 'PRIMARY', 's1', 'ref', 'PRIMARY', 'PRIMARY', '4', 'q.nid', 10000, 'Using where' 2, 'DERIVED', 's2', 'ref', 'ix_source_val,ix_source_val_nid,ix_source_vald_nid', 'ix_source_vald_nid', '4', '', 91324, 'Using index' 

As you can see, only a part of the index is used in the second case, and q is forced to be leading.

Update:

Derived queries (which is what this question concerns) are not to be confused with the subqueries.

While MySQL is not able to optimize derived queries (those used in the FROM clause), the subqueries (those used with IN or EXISTS) are treated much better.

See these articles in my blog for more detail:

Sign up to request clarification or add additional context in comments.

1 Comment

+1 for a detailed answer, and thank you for the heads-up on the ridiculous failings of the MySQL optimizer; I've had issues with it before, but nothing quite that poor.
1

Looking at the explain queries for these selects, we get this: (row headers are id, select_type, table, type, possible_keys, key, key_len, ref, rows, extra)

1 SIMPLE u system PRIMARY NULL NULL NULL 1 1 SIMPLE p const PRIMARY,address PRIMARY 4 const 1 

And the EXPLAIN for the second...

1 PRIMARY u system PRIMARY NULL NULL NULL 1 1 PRIMARY <derived2> system NULL NULL NULL NULL 1 2 DERIVED p ref address address 201 1 Using where 

So, the first query is simpler, and simpler is usually more efficient.

However, from your CREATEs, it would be vastly more efficient to add the address field to the user table. Since profile is 1-to-1 with the user table (on uid), it is possible to combine the tables and still keep the schema normalized.

Then, your query would be

SELECT u.name FROM user u WHERE u.address = 'some constant' 

and the explain shows

1 SIMPLE u ref address address 201 const 1 Using where, using filesort 

Oddly, the simplified schema uses filesorting, which is bad if you have lots of rows.

More on explain: http://dev.mysql.com/doc/refman/5.0/en/explain.html

5 Comments

I've never seen the filesort in such a query in MySQL, but in other systems this is used to speedup the index lookup. It takes the row pointers from the index and sorts them so that the row lookups will be sequential. The benefits of sequential access to the table outweight the cost to sort the index results.
I was pretty surprised that it did a filesort as well, but it can be a performance hog if you have a large table. If you have 1000 results or so, it'll take some time to write that out to a file, sort the rows in the file, and then proceed. Interesting that it's a optimization however. I'm not very familiar with the internals of DBs.
In this case filesort does not write to a file. It can write to a file indeed (which would be reflected in the plan as using temporary), but it does not in this case. Filesort is improperly named, it should be just a sort.
I should say though that I'm quite surprised to see it here anyway, since this is a const scan and the row pointers are already ordered within a single index key. More than that, it should have been optimized out since you only have 1 row in your table (if your plan is correct and the statistics are up to date). Which version of MySQL are you using?
I actually had two rows at that point. Apparently I have 4.1.20 on my localhost. Man... I need to upgrade. Thanks for the details about filesort, it's good to know.
0

Not sure how the MySQL's query engine would handle that, but my assumption would be the first query would perform better and be more efficient.

The first query is also more standard and the easier to read of the two therefore more preferable.

Comments

0

The answer usually depends on the statistics gathered by database. The first form seems to be easier for optimizer.

As far as I remember, MySQL doesn't work well with IN... queries and subselects

1 Comment

The first form is better indeed, but IN is a special case in MySQL and it is optimized quite well. You may want to read this article: explainextended.com/2009/09/18/…

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.