Most efficient way to SELECT rows WHERE the ID EXISTS IN a second table

Question

I'm looking to select all records from one table where the ID exists in a second table.

The following two queries return the correct results:

Query 1:

SELECT * FROM Table1 t1 WHERE EXISTS (SELECT 1 FROM Table2 t2 WHERE t1.ID = t2.ID)

Query 2:

SELECT * FROM Table1 t1 WHERE t1.ID IN (SELECT t2.ID FROM Table2 t2)

Are one of these queries more efficient than the other? Should I use one over the other? Is there a third method that I didn't think of that is even more efficient?

@MarcB A join will also return columns from table2 as well as multiple rows when there are multiple matches. It is not interchangeable with EXISTS or IN. — D Stanley
– D Stanley, Commented Jul 18, 2016 at 19:45
@TotZam semi join execution plan operator. Though this isn't guaranteed. Sometimes it will remove the duplicates and do an inner join instead but I wouldn't expect whether you use in or exists to make any difference to that choice. NB for NOT IN and NOT EXISTS this is not true though as they can have different semantics and performance. — Martin Smith
– Martin Smith, Commented Jul 18, 2016 at 19:55
Here is a great article on the topic. sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in — Sean Lange
– Sean Lange, Commented Jul 18, 2016 at 19:59

Community · Accepted Answer · 2020-06-20 09:12:55Z

Summary:

IN and EXISTS performed similarly in all scenarios.. Below are the parameters used to validate..

Execution cost,Time:
Same for both and optimizer produced same plan.
Memory Grant:
Same for both queries
Cpu Time,Logical reads :
Exists seems to outperform IN by little bit margin in terms of CPU Time,though reads are same..

I ran each query 10 times each using below test data set..

A very large subquery result set (100000 rows)
Duplicate rows
Null rows

For all the above scenarios, both IN and EXISTS performed in identical manner.

Some info about Performance V3 database used for testing. 20000 customers having 1000000 orders, so each customer is randomly duplicated (in a range of 10 to 100) in the orders table.

Execution cost,Time:
Below is screenshot of both queries running. Observe each query relative cost.

Memory Cost:
Memory grant for the two queries is also same..I Forced MDOP 1 so as not to spill them to TEMPDB..

CPU Time ,Reads:

For Exists:

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Customers'. Scan count 1, logical reads 109, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Orders'. Scan count 1, logical reads 3855, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected) SQL Server Execution Times: CPU time = 469 ms, elapsed time = 595 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms.

For IN:

(20000 row(s) affected) Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Customers'. Scan count 1, logical reads 109, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Orders'. Scan count 1, logical reads 3855, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. (1 row(s) affected) SQL Server Execution Times: CPU time = 547 ms, elapsed time = 669 ms. SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 0 ms.

In each case, the optimizer is smart enough to rearrange the queries.

I tend to use EXISTS only though (my opinion). One use case to use EXISTS is when you don't want to return a second table result set.

Update as per queries from Martin Smith:

I ran the below queries to find the most effective way to get rows from the first table for which a reference exists in the second table.

SELECT DISTINCT c.* FROM Customers c JOIN Orders o ON o.custid = c.custid SELECT c.* FROM Customers c INNER JOIN (SELECT DISTINCT custid FROM Orders) AS o ON o.custid = c.custid SELECT * FROM Customers C WHERE EXISTS(SELECT 1 FROM Orders o WHERE o.custid = c.custid) SELECT * FROM Customers c WHERE custid IN (SELECT custid FROM Orders)

All the above queries share the same cost with the exception of 2nd INNER JOIN, Plan being the same for the rest.

Memory Grant:
This query

SELECT DISTINCT c.* FROM Customers c JOIN Orders o ON o.custid = c.custid

required memory grant of

This query

SELECT c.* FROM Customers c INNER JOIN (SELECT DISTINCT custid FROM Orders) AS o ON o.custid = c.custid

required memory grant of ..

CPU Time,Reads:
For Query :

SELECT DISTINCT c.* FROM Customers c JOIN Orders o ON o.custid = c.custid (20000 row(s) affected) Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 48, logical reads 1344, physical reads 96, read-ahead reads 1248, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Orders'. Scan count 5, logical reads 3929, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Customers'. Scan count 5, logical reads 322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1453 ms, elapsed time = 781 ms.

For Query:

SELECT c.* FROM Customers c INNER JOIN (SELECT DISTINCT custid FROM Orders) AS o ON o.custid = c.custid (20000 row(s) affected) Table 'Customers'. Scan count 5, logical reads 322, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Orders'. Scan count 5, logical reads 3929, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 1499 ms, elapsed time = 403 ms.

A couple more that you might consider adding (I would expect to be worse) SELECT DISTINCT c.* FROM Customers c JOIN orders o ON o.custid = c.custid and SELECT c.* FROM Customers c INNER JOIN (SELECT DISTINCT custid FROM orders) AS o ON o.custid = c.custid
Actually I just downloaded that DB and tested as well and it manages to convert the first of those additional DISTINCT queries to a semi join too.
Thank you Martin,i was about to post the results,all of them with exception of 2nd inner join shared the same cost.
Something I didn't mention in my original question, yet I think would be useful to note, DISTINCT does not work nicely with fields that are 'text' data type and so even if the JOIN solutions are equivalent in efficiency, they might not work as a replacement.

D Stanley · Accepted Answer · 2016-07-18 19:50:03Z

They mean the exact same thing in pure SQL syntax, so which is more "efficient" depends on how the compiler builds the plan, and how the engine fetches the data. The only way to know for sure is to run it both ways and compare the results.

Even then, the difference is only applicable in that context, meaning there may be other situations in which the other is faster.

I use similar EXIST and IN queries all the time, so I was looking more to see if there is a general rule of which method is best and not for one specific case.
@TotZam Not really - some engines may turn one or the other into a better plan, but from a general SQL perspective there's not a difference in your example.

paparazzo · Accepted Answer · 2016-07-18 19:56:25Z

0

in general the same but when one has been better the exists has won (for me)

If t1.id is unique then you can just do a join

SELECT t1.* FROM Table1 t1 JOIN Table2 t2 ON t1.ID = t2.ID

even if t2.ID is not unique and you only want unique rows then you could

SELECT distinct t1.* FROM Table1 t1 JOIN Table2 t2 ON t1.ID = t2.ID

answered Jul 18, 2016 at 19:56

paparazzo

45.2k24 gold badges110 silver badges180 bronze badges

6 Comments

Tot Zam Over a year ago

Is a JOIN plus DISTINCT really more efficient than EXISTS or IN?

paparazzo Over a year ago

@TotZam Why don't you test it on YOUR data.

Tot Zam Over a year ago

Something I didn't mention in my original question, yet I think would be useful to note, DISTINCT does not work nicely with fields that are 'text' data type and so even if this solution is equivalent in efficiency, this would be another reason why a JOIN might not be a replacement.

paparazzo Over a year ago

@TotZam For the second time. Test on YOUR data.

Tot Zam Over a year ago

I use similar EXIST and IN queries all the time, so I was looking more to see if there is a general rule of which method is best and not for one specific case. I did do some testing and so far have not seen a difference, and wanted to know if this will always be the case.

|

Collectives™ on Stack Overflow

Most efficient way to SELECT rows WHERE the ID EXISTS IN a second table

3 Answers 3

5 Comments

2 Comments

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

2 Comments

6 Comments

Linked

Related