select rows that all exist in other table - how to improve performance?

Question

I've three tables in SQL Server 2008 R2:

PRODUCTS (id int, title varchar(100), ....) WORDS (id int,word varchar(100) ) WORDS_IN_TITLES (product_id int, word_id int)

and now I want to select all products that have certain words used in its title.

Now i do it that way:

declare @words tp_intList insert into @words values(154) insert into @words values(172) declare @wordsCnt int = (select count(*) from @words) select * from products where id IN ( select product_id from WORDS_IN_TITLES inner join (select id from @words) wrds ON wrds.id=WORDS_IN_TITLES.word_id group by product_id HAVING count(*)=@wordsCnt )

It works but it's slow. Table contains 600k rows, and returning 3.5k rows takes around 4 seconds. I need it to go much below 1 second. How can I improve the performance?

What are the counts for each of the products and words_in_titles tables? Is 600k in products? — RichardTheKiwi
– RichardTheKiwi, Commented Feb 15, 2011 at 21:02

Joe Stefanelli · Accepted Answer · 2011-02-15 20:56:09Z

select products.* from products inner join (select p.id from products p inner join words_in_titles wit on p.id = wit.product_id where wit.word_id in (154,172) group by p.id having count(distinct wit.word_id) = 2) q on products.id = q.id

In fact timing suggests it is slower, because it brings in Products to the mix too early.

Community · Accepted Answer · 2020-06-20 09:12:55Z

It doesn't look like your query can be improved much.

Below is a sample table constructed with 600k rows for products and nearly 600k rows for words_in_titles. For every 2 word_ids picked randomly, there should be about 3 to 10 products matching the combination.

Create tables and populate with data. Create index also on words_in_titles (word_id)

create table products (id int identity primary key clustered, title varchar(100)) insert into products select convert(varchar(max),NEWID()) from master..spt_values a inner join master..spt_values b on b.type='p' and b.number between 0 and 999 where a.type='P' and a.number between 0 and 600 create table words_in_titles (product_id int, word_id int, primary key clustered(product_id, word_id)) insert words_in_titles select distinct a,b from ( select floor(convert(bigint,convert(varbinary(max),newid())) % 60000) a, floor(convert(bigint,convert(varbinary(max),newid())) % 1000) b from master..spt_values a inner join master..spt_values b on b.type='p' and b.number between 0 and 999 where a.type='P' and a.number between 0 and 600 ) x create index ix_words_in_titles on words_in_titles(word_id)

Next time the different approaches. We use SET STATISTICS to see internal stats. You should also check the execution plan (but not while checking statistics - it pollutes the stats). DBCC commands are used to flush buffers and clear plans, set the @clean bit to 1 to clear between runs, and 0 to emulate during the day runs where data could already be in the buffer.

declare @clean bit set @clean = 1 if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache') set statistics io off set statistics time off -- pick two random word_id's as generated (@word1 and @word2 used below) declare @word1 int, @word2 int select top 1 @word1 = word_id from words_in_titles order by NEWID() select top 1 @word2 = word_id from words_in_titles where word_id <> @word1 order by NEWID() declare @words table (id int) insert into @words values(@word1) insert into @words values(@word2) declare @wordsCnt int = (select count(*) from @words) set statistics io on set statistics time on if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache') select * from ( select w.product_id from words_in_titles w where w.word_id = @word1 and exists (select * from words_in_titles t where t.word_id=@word2 and t.product_id=w.product_id) -- expand with more EXISTS clauses ) q inner join products p on p.id = q.product_id if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache') select * from ( select w1.product_id from words_in_titles w1 where w1.word_id = @word1 intersect select w2.product_id from words_in_titles w2 where w2.word_id = @word2 ) q inner join products p on p.id = q.product_id if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache') select * from products where id IN ( select product_id from WORDS_IN_TITLES inner join (select id from @words) wrds ON wrds.id=WORDS_IN_TITLES.word_id group by product_id HAVING count(*)=@wordsCnt ) if(@clean=1) exec ('dbcc dropcleanbuffers dbcc freeproccache') select products.* from products inner join (select p.id from products p inner join words_in_titles wit on p.id = wit.product_id where wit.word_id in (@word1,@word2) group by p.id having count(distinct wit.word_id) = 2) q on products.id = q.id

The Execution Plans are not shown here, but if you view it, both alternatives and the original query are all quite similar.
The first two queries collect from words_in_titles twice, but each for a separate word_id, then use different strategies for working out the intersection
The original query streams only once from words_in_titles and aggregates instead but requires an additional I/O operation to sort in tempdb.
Joe's query uses a more complex plan.

Statistics: Look for batches that start with Table and the very next SQL Server Execution Times after it. Four such fragments will represent the 4 queries timed.

Table 'products'. Scan count 0, logical reads 30, physical reads 0, read-ahead reads 51, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'words_in_titles'. Scan count 2, logical reads 8, physical reads 2, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#4D5F7D71'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 47 ms.

(a shown above, the original query uses a temporary sort).
If you run it enough times, you will see that the last query always performs slower than the rest, 2nd one is normally faster than the first and the 3rd (original) sometimes comes before or after the 1st and 2nd.

Conclusion

You can try either of the alternatives presented, but there is unlikely to be much improvement to your query.

Collectives™ on Stack Overflow

select rows that all exist in other table - how to improve performance?

2 Answers 2

1 Comment

Conclusion

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Conclusion

Comments

Related