1

I'm trying to get all users that haven't been sent a message in a given day. I have two tables, one with the messages that have a reporterid indexed column, and a reporter table.

Let me walk you through it:

This inner query returns reporters that have received messages in a given day. It takes tenths of a second on my local dev machine:

select m.reporterid from nubamessage m where m.messagetypeenum =7 and m.createdOn>='2016-06-18 00:00:00' and m.createdOn<='2016-06-18 09:30:00'; 

There is no difference in the execution time of the above query if I do select(distinct) or not, and it returns the same rows.

But if I used that in an outer join query, it basically never completes:

select * from reporter r left outer join ( SELECT m.reporterid from nubamessage m where m.messagetypeenum =7 and m.createdOn>='2016-06-18 00:00:00' and m.createdOn<='2016-06-18 09:30:00' ) as mm ON mm.reporterid=r.id where r.enabled=1 and m.reporterid is null; 

However, if I change the inner query to distinct as the below, it takes a fraction of a second:

select * from reporter r left outer join ( SELECT distinct(m.reporterid) -- the only difference from nubamessage m where m.messagetypeenum =7 and m.createdOn>='2016-06-18 00:00:00' and m.createdOn<='2016-06-18 09:30:00' ) as mm ON mm.reporterid=r.id where r.enabled=1 and m.reporterid is null; 

The execution plan for the non-completing query:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 SIMPLE r NULL ALL NULL NULL NULL NULL 11538 50.00 "Using where" 1 SIMPLE m NULL ALL NULL NULL NULL NULL 968388 100.00 "Using where; Using join buffer (Block Nested Loop)" 

The execution plan for the query where I just add "distinct", it adds some auto-key:

id select_type table partitions type possible_keys key key_len ref rows filtered Extra 1 PRIMARY r NULL ALL NULL NULL NULL NULL 11538 50.00 "Using where" 1 PRIMARY <derived2> NULL ref <auto_key0> <auto_key0> 5 r.id 10 100.00 "Using index" 2 DERIVED m NULL ALL NULL NULL NULL NULL 968388 1.11 "Using where; Using temporary" 

If someone could find it in their heart to explain to me why 1. the first query never completes/is so slow and 2. why the distinct makes it run fast, even though the inner query returns the exact same rows in both cases, I'd be most grateful.

Thanks

Edit: Notice that I write "all users that haven't been sent a message. This is why I have the left join - later in the query I have "where m.reporterid is null", in order to only get out the reporters that don't already have a message in the messagetable. I have edited the questions above to reflect that.

I could have used a "not in" instead but from what I've read and in my tests it was a bit slower than left join.

1
  • edited the sql queries Commented Oct 19, 2016 at 10:33

2 Answers 2

1

LEFT is killing performance; remove it unless you have a good reason for keeping it. To elaborate...

((but first... The purpose for the LEFT was missing from the original query; this answer assumes the LEFT was not necessary.))

LEFT JOIN says that you want data from the right table, whether or not there was a match with the left table.

JOIN says to display only the rows that match (via ON) both tables.

(OUTER is optional and adds no semantics.)

If you are running a version before 5.6, the derived table (subquery in LEFT JOIN) will have no index, so it must be scanned repeatedly. This is a big reason to get rid of LEFT.

Without the LEFT, the Optimizer is likely to evaluate the subquery once, then efficiently JOIN to reporter to finish the query.

For JOIN, the inner query needs this composite (and covering) index INDEX(messagetypeenum, createdOn, reporterid).

There is another technique (I think)... Get rid of the inner SELECT, simply JOIN (or LEFT JOIN) to the table:

select r.* from reporter r left join nubamessage m ON m.reporterid = r.id AND m.messagetypeenum = 7 and m.createdOn >= '2016-06-18 00:00:00' and m.createdOn <= '2016-06-18 09:30:00' WHERE r.enabled = 1 AND m.reporterid is null; 

In this case, it may needs INDEX(reporterid, messagetypeenum, createdOn).

Yet another variant would use EXISTS and, I think, provides the equivalent of the LEFT

select r.* from reporter r WHERE r.enabled=1 AND NOT EXISTS ( SELECT * from nubamessage m where m.messagetypeenum =7 and m.createdOn>='2016-06-18 00:00:00' and m.createdOn<='2016-06-18 09:30:00' AND m.reporterid=r.id ) 

I can't predict which variant will be fastest. It partially depends on how "many" in the many:1 mapping of m.reporterid : r.id.

3
  • The thing is that I want the reporters that are NOT in the messagetable. I notice now that I didn't post the entire question, later on I have a "where m.reporterid is null." I only want the reporters I haven't already sent to. This is why I have the left join. I've added some text to clarify. Commented Oct 19, 2016 at 7:52
  • My two suggested reformulations may still work for you. Commented Oct 19, 2016 at 17:47
  • @ypercubeᵀᴹ - thanks. Matthias - I changed both queries. Commented Oct 19, 2016 at 19:35
0

In most cases LEFT JOIN is the first point for optimizations.

This query does the same things as yours, but it use aggregations instead of LEFT JOIN:

SELECT counters.id FROM ( SELECT r.id, 0 as cnt, 1 as enabl FROM reporter r WHERE r.enabled=1 UNION SELECT m.reporterid as id, count(1) as cnt, 0 as enabl FROM nubamessage m WHERE m.messagetypeenum =7 and m.createdOn>='2016-06-18 00:00:00' and m.createdOn<='2016-06-18 09:30:00' GROUP BY m.reporterid ) AS counters GROUP BY counters.id HAVING sum(counters.cnt)=0 and sum(counters.enabl)>0 

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.