NULL and UNIQUE

When I worked for MySQL I saw frequent complaints that the UNIQUE constraint didn’t stop users from inserting NULLs, multiple times. For example:
CREATE TABLE t (s1 INT, UNIQUE (s1));
INSERT INTO t VALUES (NULL);
INSERT INTO t VALUES (NULL); /* This does not cause a “unique constraint violation” error */

There are now eleven generously-commented bugs.mysql.com reports:
#5685, #6829, #7479, #8173, #9844, #17825, #19377, #25544, #27019, #27376, #66512. The essential points (along with observations about how we were stupid or deaf or lazy) are:

  • the ANSI/ISO SQL standard said we should throw an error,
  • all the other major DBMSs would throw an error,
  • and throwing an error would be more sensible and convenient.

The first point is false; the second point depends what “major” means; the third point is a valid opinion.

I will now attempt to beat this subject to death with enough thoroughness that it will become dead.

Exactly what the ANSI/ISO standard says, and how it applies

Our example will be a table named t created thus:
CREATE TABLE t (s1 INT, UNIQUE (s1));
INSERT INTO t VALUES (NULL),(NULL);

So t looks like this:

+------------+
| s1         |
+------------+
| NULL       |
| NULL       |
+------------+

Our question will be: is the constraint “UNIQUE (s1)” violated according to the SQL standard?

I’ll use a recent draft of SQL:2011 because it’s current and it’s available for all to see, at:
jtc1sc32.org/doc/N1951-2000/32N1964T-text_for_ballot-FCD_9075-2.pdf.
The wording is about the same in all earlier versions of the standard since SQL-89, which didn’t allow NULLs at all for UNIQUE constraints.

Section 11.7 says:

[Syntax Rules]
4) … let SC be the <search condition>:
UNIQUE ( SELECT UCL FROM TNN )

where UCL means “Unique Column List” and TNN means “Table Name”.
So for our example, search condition SC is “UNIQUE (SELECT s1 FROM t)”.
[General Rules]
3) The unique constraint is not satisfied if and only if
EXISTS ( SELECT * FROM TNN WHERE NOT ( SC ) )
is True.

… Plugging in our values, that means

3) The unique constraint is not satisfied if and only if
EXISTS ( SELECT * FROM t WHERE NOT ( UNIQUE (SELECT s1 FROM t) ) )
is True.

So now we have to know how “UNIQUE(…)” is supposed to work.
Section 8.11 says:

If there are no two rows in T such that the value of each column in one row is non-null and is not distinct from the value of the corresponding column in the other row, then the result of the <unique predicate> is True; otherwise, the result of the <unique predicate> is False.

… where T is the table.

Now, apply the predicate “UNIQUE (s1)” to table t.
For either Row#1 or Row#2: Is the value of column s1 non-null?
Answer: NO. It is NULL.
Therefore the requirement “that the value of each column in one row is non-null” is met for zero rows in t.

But the <unique predicate> is only False if the requirement is met for two rows in t. Therefore the result of the <unique predicate> is True.

In other words,
UNIQUE (s1) is True.

Therefore
NOT (UNIQUE (s1)) is False.

Therefore
SELECT * FROM TNN WHERE NOT (UNIQUE (t)) returns zero rows.

Therefore
EXISTS ( SELECT * FROM TNN WHERE NOT ( UNIQUE (t) ) ) is false.

But Rule 3) says that the unique constraint is not satisfied if and only if that EXISTS clause is True.
Therefore the constraint is satisfied.
And when a constraint is satisfied, it is not violated.

Thus, according to the rules for <unique constraint definition>, our example table violates no constraint.

It’s tedious to work through this because there are multiple negatives, but it is not ambiguous.

Why, then, do some people have trouble? Partly because they look at a different statement in the introductory sections. Section 4.17 Integrity constraints says:

A unique constraint is satisfied if and only if no two rows in a table have the same non-null values in the unique columns.

And I have to admit that it’s not at all obvious what that’s supposed to mean. But, since you have now seen what is in the later and more detailed sections, you should read it as “no two rows have unique-column values which are the same (an undefined word which probably means equal) and which are also non-null”.

What other DBMSs do

Trudy Pelzer and I wrote this on page 260 of our book, SQL Performance Tuning:

DBMS Maximum number of NULLs when there is a UNIQUE constraint
IBM (DB2) One
Informix One
Ingres Zero
InterBase Zero
Microsoft (SQL Server) One
MySQL Many [although the BDB storage engine was an exception]
Oracle Many
Sybase One

DB2 has an optional clause which can force standard behaviour.
Oracle is only allowing multiple NULLs because it doesn’t put NULLs in indexes; if we were talking about a multiple-column constraint, Oracle would be non-standard.

The book didn’t cover PostgreSQL and Access and Firebird, which all allow “Many” NULLs. A newer entrant, NuoDB, allows only one NULL (I told them that’s non-standard but they didn’t believe me).

Most of the complainers were apparently thinking of Microsoft SQL Server. So it’s interesting that on Microsoft’s discussion boards the complaints are (spoiler alert!) that SQL Server doesn’t follow the standard and should be more like other DBMSs. See, for example, NULL in unique index (where SQL Server gets it wrong) and Change UNIQUE constraint to allow multiple NULL values.

According to SQL Server Magazine, in 2008,

ANSI SQL supports two kinds of UNIQUE constraints — one that enforces uniqueness of NULLs just like with known values, and another that enforces uniqueness of known values but allows multiple NULLs. Microsoft SQL Server implemented only the former.

As we’ve seen, that’s false, and so is another claim, that “Informix and Microsoft SQL Server follow the other interpretation of the standard.” But perhaps Microsoft’s stories help explain people’s beliefs.

What is more sensible

An unfortunate defence of the allow-many-NULLs behaviour is that “Null values are not considered equal”. That’s what’s happening in effect in this particular case, but it’s not a general or useful rule.

  • (a) if we were doing a NULL/NULL comparison, or a NULL/not-NULL comparison, the result would be UNKNOWN. For other constraints an UNKNOWN result is considered a violation.
  • (b) “equality” is the wrong concept to consider here, it is more important to consider whether the values are “not distinct” — and in fact two NULL values are not distinct.
  • (c) NULLs sort together for purposes of GROUP BY, ORDER BY, or duplicate elimination.

Therefore, the UNIQUE rules do not follow automatically from how NULLs behave elsewhere in SQL. That is why the standard had to add wording to specify that the uniqueness predicate applies only for “non-null” values.

So, regarding what looks more sensible, it seems that the complainers have a point.

What is more convenient?

It is better, now, to carry on as before. The current behaviour is the de facto and the de jure standard. It is neither a bug nor undesirable.

But would it be even better if MySQL or MariaDB allowed an option? The DB2 syntax to allow UNIQUE with many NULLs is:
CREATE TABLE t (s1 INT UNIQUE WHERE NOT NULL)
so presumably the way to specify UNIQUE with one NULL would be:
CREATE TABLE t (s1 INT UNIQUE WHERE NULL OR NOT NULL)
but it would be easier to understand if we said
CREATE TABLE t (s1 INT UNIQUE AND MAXIMUM NUMBER OF NULLS = 1).

Anyway, it would be possible. But a nice first step would be to acknowledge that it is a feature request rather than a bug.

Copyright (c) 2013 by Ocelot Computer Services Inc. All rights reserved.

, September 11, 2013. Category: MySQL / MariaDB.

About pgulutzan

Co-author of four computer books. Software Architect at MySQL/Sun/Oracle from 2003-2011, and at HP for a little while after that. Currently with Ocelot Computer Services Inc. in Edmonton Canada.

9 Comments

  1. Hi Peter, I’m just stopping by to say hi!

    It has been a while since I read your articles – it’s great to do so again 🙂 You’re skills in dissecting topics like this, and your thoroughness in reasoning about them is simply delightful to witness. Kudos!

    warm regards,

    Roland Bouman

  2. Well, all through the years with MySQL, this is exactly how I looked at it, but many disagreed or didn’t understand the issue. My reasoning was more simple than yours though (but I have looked it up in SQL-99): Since NULL isn’t a value, rather it’s the absence of a value, then all comparisons fail.

    Or: Here is a parking spot, and only one car can park there at the time. So if car A is there, car B cannot be there at the same time. But when there is an absence of a car in the parking spot, then both A and car B isn’t there. So if the value of the spot is NULL (no car parked), then both statements “Car A isn’t here” and “Car B isn’t here”, actually any car, isn’t there. Hmm this started out as a god example, but rereading it I don’t know anymore. Also, this is not true if this is Larrys parking spot. If your car A is in that spot, then it will probably be shot at by fighter jet Mig-21 B.

    • pgulutzan says:

      Actually comparisons don’t “fail”, they succeed, the result is unknown. So, as I try to suggest in the post, it’s not NULL=NULL comparisons that establish correct behaviour, it’s what the standard says about this specific situation.

  3. Jan Steinman says:

    Keep in mind that NULL is not a value! It is a flag that says, “This item is not a value.”

    That’s the simple test I keep in mind to help me through such things. Too many people have been polluted by C-like languages, that often define NULL as the same as zero. I prefer the Smalltalk/SIMULA heritage, where the nil object is equivalent to itself, but not equal to anything else.

  4. Lars Johansson says:

    Hi,
    Thanks for an interesting post.
    I always regarded NULL as no information/not defined and to me it seems SQL is of the same opinion since NULL = NULL is false, (something I many times have had problems to explain to others).
    My opinion is; it is more convenient to allow many NULLs in unique relations than not, i.e. do not put NULLS in unique indexes

    • pgulutzan says:

      Mr Johansson and Mr Steinman: Thanks for the comments, but the statement “NULL = NULL is false” is wrong, and it has no effect on this issue whether NULL is called a value so let’s let that be a topic for another day. [ Update: I originally wrote ‘false’ rather than ‘wrong’ but that was too glib; the point though is addressed in the original post. ]

  5. Sergei Golubchik says:

    Peter, typos:

    You say
    “But the unique predicate is only True if the requirement is met for two rows in t. Therefore the result of the unique predicate is False.”
    Should be
    “But the unique predicate is only False if the requirement is met for two rows in t. Therefore the result of the unique predicate is True.”

    You say
    “In other words, UNIQUE (s1) is False. Therefore NOT (UNIQUE (s1)) is True.”
    Should be
    “In other words, UNIQUE (s1) is True. Therefore NOT (UNIQUE (s1)) is False.”

    The rest is correct, “SELECT * FROM TNN WHERE NOT (UNIQUE (t)) returns zero rows.” precisely because NOT(UNIQUE(t)) is False.

    • pgulutzan says:

      Thanks for the observation, and thanks for kindly saying I made a mere typo when I wrote False instead of True, and vice versa. I have edited the post.