10

Why does the comparison of value to null return false, except when using a NOT IN, where it returns true?


Given a query to find all stackoverflow users who have a post:

SELECT * FROM Users WHERE UserID IN (SELECT UserID FROM Posts) 

This works as expected; i get a list of all users who have a post.

Now query for the inverse; find all stackoverflow users who don't have a post:

SELECT * FROM Users WHERE UserID NOT IN (SELECT UserID FROM Posts) 

This returns no records, which is incorrect.

Given hypothetical data1

Users Posts ================ =============================== UserID Username PostID UserID Subject ------ -------- ------- ------ ---------------- 1 atkins 1 1 Welcome to stack ov... 2 joels 2 2 Welcome all! ... ... ... ... 399573 gt6989b ... ... ... ... ... ... 10592 null (deleted by nsl&fbi... ... ... 

And assume the rules of NULLs:

  • NULL = NULL evaluates to unknown
  • NULL <> NULL evaluates to unknown
  • value = NULL evaluates unknown

If we look at the 2nd query, we're interested in finding all rows where the Users.UserID is not found in the Posts.UserID column. i would proceed logically as follows:

Check UserID 1

  • 1 = 1 returns true. So we conclude that this user has some posts, and do not include them in the output list

Now check UserID 2:

  • 2 = 1 returns false, so we keep looking
  • 2 = 2 returns true, so we conclude that this user has some posts, and do not include them in the output list

Now check UserID 399573

  • 399573 = 1 returns false, so we keep looking
  • 399573 = 2 returns false, so we keep looking
  • ...
  • 399573 = null returns unknown, so we keep looking
  • ...

We found no posts by UserID 399573, so we would include him in the output list.

Except SQL Server doesn't do this. If you have a NULL in your in list, then suddenly it finds a match. It suddenly finds a match. Suddenly 399573 = null evaluates to true.

Why does the comparison of value to null return unknown, except when it returns true?

Edit: i know that i can workaround this nonsensical behavior by specifically excluding the nulls:

SELECT * FROM Users WHERE UserID NOT IN ( SELECT UserID FROM Posts WHERE UserID IS NOT NULL) 

But i shouldn't have to, as far as i can tell the boolean logic should be fine without it - hence my question.

Footnotes

  • 1 hypothetical data; if you don't like it: make up your down.
  • celko now has his own tag
13
  • 3
    Duplicate: See stackoverflow.com/questions/129077/… Commented Oct 13, 2010 at 14:30
  • 1
    @Ian Boyd = I don't think that's a bug. Something else to remember is IN uses or and NOT IN uses AND. When you are evaluating against a NULL with an inequality vs. a known value, you will always get a false since there is no way to know if it matches or not. Commented Oct 13, 2010 at 14:49
  • 3
    @Ian Boyd: I wouldn't call this a bug. IN is nothing more than a convenient shorthand for a series of OR clauses. Technically, the correct way to think of the expansion given in the other question is: select 'true' where NOT(3 = 1 or 3 = 2 or 3 = null), which is logically equivalent by DeMorgan's Law. In any case, the fallacy is assuming that a comparison of a value=NULL returns FALSE when in fact the result is UNKNOWN. Commented Oct 13, 2010 at 14:52
  • 5
    Yep. Not a bug. Also it is not boolean logic. It is 3 valued logic. Commented Oct 13, 2010 at 15:17
  • 2
    @Ian Boyd - For your sister's name issue, that's a data modelling issue. You should have a subtable SISTER that refers back to your main table. If you don't have a sister you wouldn't be included in the result set. Also, NULL means unknown. If you have 2 people whose names you don't know, can you say one of their names isn't Kirsten? Commented Oct 16, 2010 at 18:21

2 Answers 2

13

Common problem, canned answer:

The behavior of NOT IN clause may be confusing and as such it needs some explanations. Consider the following query:

SELECT LastName, FirstName FROM Person.Contact WHERE LastName NOT IN('Hedlund', 'Holloway', NULL) 

Although there are more than a thousand distinct last names in AdventureWorks.Person.Contact, the query returns nothing. This may look counterintuitive to a beginner database programmer, but it actually makes perfect sense. The explanation consist of several simple steps. First of all, consider the following two queries, which are clearly equivalent:

SELECT LastName, FirstName FROM Person.Contact WHERE LastName IN('Hedlund', 'Holloway', NULL) SELECT LastName, FirstName FROM Person.Contact WHERE LastName='Hedlund' OR LastName='Holloway' OR LastName=NULL 

Note that both queries return expected results. Now, let us recall DeMorgan's theorem, which states that:

not (P and Q) = (not P) or (not Q) not (P or Q) = (not P) and (not Q) 

I am cutting and pasting from Wikipedia (http://en.wikipedia.org/wiki/De_Morgan_duality). Applying DeMorgan's theorem to this queries, it follows that these two queries are also equivalent:

SELECT LastName, FirstName FROM Person.Contact WHERE LastName NOT IN('Hedlund', 'Holloway', NULL) SELECT LastName, FirstName FROM Person.Contact WHERE LastName<>'Hedlund' AND LastName<>'Holloway' AND LastName<>NULL 

This last LastName<>NULL can never be true

Sign up to request clarification or add additional context in comments.

4 Comments

DeMorgan's theorem applies to boolean logic, not to 3-valued logic. For example: not (unknown or true) = false, but (not unknown) and (not true) = unknown, violating the theorem
My confusion comes from the second "equivalent" set of queries. i would argue that they're not equivalent - since they return different answers. But my argument is irrelevant, since the ANSI guys defined the first is an alias of the 2nd.
Accepted this answer, as it points out NOT IN is defined to be an alias for (x<>a) AND (x<>b) AND (x<>c) ....
@Andomar - Actually, I believe it goes: (not unknown) and (not true) = unknown and false = false, which is the same result as: not (unknown or true) = not (true) = false. Similarly, not (unknown or false) = not (unknown) = unknown, which is the same result as: not (unknown and true) = not (unknown) = unknown. So the theorem holds for 3-valued logic, provided the axioms: unknown or true = true; unknown and false = false;
10

The assumption in your first sentence isn't right:

Why does the comparison of value to null return false, except when using a NOT IN, where it returns true?

But comparison of a value to null does not return false; it returns unknown. And unknown has its own logic:

unknown AND true = unknown unknown OR true = true unknown OR false = unknown 

One example of how this works out:

where 1 not in (2, null) --> where 1 <> 2 and 1 <> null --> where true and unknown --> where unknown 

The where clause only matches on true, so this filters out any row.

You can find the full glory of 3 value logic at Wikipedia.

2 Comments

i changed all references involving AND null to evaluate to unknown. Question still stands.
I +1'd the other answer for the detail. But +1 to this to, for mentioning 3 valued logic

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.