Category: MySQL / MariaDB


TokuDB Features

Many people have tested whether the TokuDB storage engine runs faster and compresses better than the default storage engine, InnoDB. I am more concerned about TokuDB’s features. Or, to put it unfairly: can it do everything that InnoDB can do, in a current version?

Vadim Tkachenko of Percona supplied a binary download for a platform that I happen to have (Ubuntu 12.04), so I got MySQL 5.6 + TokuDB 7.1 up and running within 15 minutes. Then I found that I could not break anything within 60 minutes. I I conclude that “ease of use” and “stability” are okay. (For anyone who thinks I’m rushing: yes, but usually finding a bug in a new version takes less than an hour so I’m judging from experience.)

Almost everything = No problem

I tried compound indexes, long strings, the relatively-new TIMESTAMP(6) data type with fractional seconds, utf8mb4 varchar columns with German collation, transaction rollbacks with or without savepoints, crash recovery, and partitions.

I always got the same results as I get with InnoDB. It seems that the Tokutek folks looked at compatibility before declaring that their product is a “drop-in replacement storage engine for MySQL and MariaDB”.

Foreign keys

These statements didn’t cause an error but the FOREIGN KEY clause was ignored:

CREATE TABLE tp (s1 INT, PRIMARY KEY (s1)) engine=tokudb;
CREATE TABLE tf (s1 INT, FOREIGN KEY (s1) REFERENCES tp (s1)) engine=tokudb;

If I use InnoDB, the statements succeed. In basic English, TokuDB doesn’t support referential integrity.

This is a big deal because foreign keys are part of “core” standard SQL. In mitigation, I have to allow that many users eschew foreign keys in this benighted sharded world, and that in theory one could replace some foreign-key functionality with triggers. But I doubt that anybody ever does.

Full text

These statements resulted in an error:

CREATE TABLE tq (s1 TEXT) engine=tokudb;
CREATE FULLTEXT INDEX iq ON tq (s1);

If I use InnoDB, the statements succeed, because InnoDB (since version 5.6.4) supports full-text indexing, TokuDB doesn’t.

I’m sure I could hear objections like “that’s not what TokuDB is for” (true), “InnoDB’s implementation is bad” (actually that’s my own opinion but I’m sure somebody somewhere shares it), or “use Sphinx” (I don’t understand the logic here but I heard it when I complained about a full-text deficiency in another product).

To refute all objections, I went to bugs.mysql.com, clicked “Return only bugs with status = ALL”, clicked “Show = ALL”, clicked “Restrict to bugs in selected categories = MySQL Server: FULLTEXT search”, and clicked “Search”. I got 107 hits So full text must be important to somebody, else they wouldn’t report bugs.

Isolation

I started two connections, both with autocommit = 0 and default transaction isolation = REPEATABLE READ. On the first connection I said

CREATE TABLE t2 (s1 INT, UNIQUE(s1)) engine=tokudb;
INSERT INTO t2 VALUES (0);

On the second connection I said:

INSERT INTO t2 VALUES (1);

Result: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction.
If I use engine=innodb, I do not see this error, both INSERTs work.

I also see that there’s a documented behaviour difference with SELECT FOR UPDATE.

I conclude that TokuDB’s different locking methods mean that some SQL statements will have different effects in multi-user environments, if one switches from InnoDB.

Names

A storage engine maker cannot muck with MySQL’s syntax, with two exceptions: (1) this rule doesn’t apply to InnoDB; (2) new variables or information_schema tables can be added that are specific to tuning or monitoring.

So I said “SHOW VARIABLES LIKE ‘tokudb%'” … and I did see some things that bothered me. For example:
() tokudb_lock_timeout seems to be something like innodb_lock_wait_timeout so it’s unfortunate that they didn’t call it tokudb_lock_wait_timeout
() tokudb_commit_sync seems to do some things like innodb_flush_log_at_trx_commit so it’s unfortunate that they didn’t use the same numbers
() tokudb_fs_reserve_percent? well, I couldn’t tell if it’s unfortunate or not because I couldn’t find it in the TokuDB user manual.

Platforms

Tokutek’s FAQ says that TokuDB was tested on Centos 5 and CentOS 6. Doubtless it also works on all other Linux 64-bit distros (I downloaded for Ubuntu 12.04 and the MariaDB manual mentions Fedora). I see that Antony Curtis got it going on his Macbook.

That leaves three major items from the list of platforms that MySQL supports: FreeBSD, Solaris, and Windows. The trouble is that one of them, Windows, is what beginners or students might start with. As a result, MySQL + InnoDB or MariaDB + InnoDB will remain the default choice for introductory texts and some personal uses.

I can’t see how this will change, since there’s no money in the entry-level market and there’s no point in a “Windows server” for Tokutek’s real objective, which is to capture a piece of the must-have-performance market. But if they don’t care, the mind share won’t be there.

You’re forearmed

I fear that, by concentrating on TokuDB’s performance advantages, people have failed to worry about losing features that are available with InnoDB. Even if you conclude that you want to use TokuDB — especially if you conclude that you want to use TokuDB — you should be aware of them.

The UTF-8 world is not enough

In English there are very few words of Japanese origin but I think this one has a great future: mojibake. Mojibake is the garbage you see when MySQL or MariaDB has a column definition saying character set A, stores into it a string that’s actually in character set B, then ships it to a client which expects everything to be in character set C.

For some DBMSs (Drizzle and NuoDB spring to mind) it’s apparent that the developers decided that users could avoid mojibake, and developers could avoid a lot of extra trouble, if everybody had the same character set: UTF-8. Well, MySQL and MariaDB have web users and UTF-8 is popular on the web. According to Web Technology Surveys, 80% of web sites use UTF-8. There’s a Google study that came to a vaguely similar conclusion.

And yet, and yet, the arguments are devastating for why should MySQL or MariaDB continue to support character sets other than UTF-8, and even add one or two more.

Space

With a Greek character set like 8859-7, it takes one (1) byte to store one ordinary Greek character. With UTF-8 it takes two (2) bytes to store one ordinary Greek character. So 8859-7 is 50% cheaper for Greek words, and the same ratio holds for other 8-bit non-Latin language-specific character sets like Bulgarian.

With a multi-byte character set like euckr (for Korean) things get more complicated and results will vary. I took the lead paragraph for the Korean Wikipedia article on Seoul, dumped the contents into columns with different character set definitions, and compared the sizes with the octet_length() function:

utf8     1260              /* worst */
utf16    1138              /* middle */
euckr     916              /* best */

Unsurprisingly, it takes fewer bytes to store Korean text with a “made-in-Korea” character set. Also it’s a bit cheaper to use utf16 (which always requires two bytes per character for ordinary Korean words), which is doubtless one reason that the SQL Server folks prefer UTF-16 over UTF-8. I of course will ignore any objections that space doesn’t matter, or that compression algorithms exist.

Conversion trouble

It is true that Unicode’s repertoire is a superset of what’s in Japanese character sets like sjis or ujis. But that does not mean that you can start with sjis text, convert to Unicode, then convert back to sjis. On some rare occasions such a “round trip” will fail. This happens because the mapping is not always one-to-one. But it also happens because there are slight differences between various versions of sjis — that’s a fact which causes no difficulty if there’s no conversion, but causes some difficulty if something comes in from the client as sjis, gets converted to Unicode for storage, then gets converted back to sjis when the server retrieves something for the client.

Legacy

At the start of this post I mentioned that 80% of web sites use UTF-8. However, that is the world-wide figure. For a language-specific look, here are two sobering statistics:

* Only 54% of Chinese-language web sites predominantly use UTF-8
* Only 60% of Japanese-language web sites predominantly use UTF-8
Source: Web Technology Surveys

Hunting for examples, I went to the alexa.com page for “Top Sites in China”, then on my Firefox browser I clicked Tools > Page Info: Encoding. Five of the top ten said their encoding was UTF-8, but the other five said their encoding was GBK. This is not a reliable test because the site owner could be fibbing. But even if only a few are telling the truth, it’s clear there are many holdouts in the Non-UTF-8 crannies.

Of course for all of these arguments there are counter-arguments or even possible refutations. But they wouldn’t matter to the DBMS vendor. The vendor is supposed to help the customer with the customer’s plan, rather than explain that they should do something which, with suspicious convenience, would make the vendor’s life easier.

I had a look at the open-source SQL DBMSs, and as far as I could tell only five support both storage and input with a variety of encodings: MySQL and MariaDB of course, but also PostgreSQL, Ingres, and Firebird.
VoltDB definitely cannot. The Java-based ones apparently cannot, but I didn’t make certain.

Regular (RLIKE and REGEXP) expressions: Good News

Ten years ago, MySQL got a “bug report” about trouble with RLIKE. It got marked “won’t fix” because MySQL used a regular-expression library that couldn’t handle non-ASCII characters reliably. Over time there were multiple similar or duplicate bug reports but the only result was a note in the MySQL manual saying, in effect, “tough luck”. Until now.

Actually the fix is in a pre-release of the bzr branch for MariaDB 10.0.5, and I can’t predict that the MySQL folks will copy it, but it looks good to me, and I clap for the original coders who made the “Perl Compatible Regular Expressions” library, the “Google Summer of Code” folks who pay students to help with open source projects, the student coder in question — Sudheera Palihakkara from a Sri Lanka university — and finally Alexander Barkov from the MariaDB foundation.

So what were the problems and what’s been solved — or not solved?

Solved: Handling any characters

Most of the complaints about REGEXP have been about handling of accented Latin characters, Cyrillic characters, or, in short: regardless of character set or collation, if any character in the pattern was outside the ASCII range or was NUL, then there would be false hits, no hits, miscalculated lengths, or failed case conversions. I took examples from the “how to repeat” section of those bug reports, and tried them with MariaDB 10.0.5. Results:

/* Bug#34473 */
MariaDB [test]> SELECT 'Ørneborgen' RLIKE '^[Ö]';
+-----------------------------+
| 'Ørneborgen' RLIKE '^[Ö]'   |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (0.03 sec)

/* Bug#54576 */
MariaDB [test]> SELECT 'č' REGEXP '^.$';
+-------------------+
| 'č' REGEXP '^.$'  |
+-------------------+
|                 1 |
+-------------------+
1 row in set (0.00 sec)

/* Bug#52080 */
MariaDB [test]> SELECT 'Я' regexp 'я';
+------------------+
| 'Я' regexp 'я'   |
+------------------+
|                1 |
+------------------+
1 row in set (0.02 sec)

/* Bug#63439 */
MariaDB [test]> SELECT 'ääliö' REGEXP '^[aä]{1}[aä]{1}li[oö]{1}$';
+--------------------------------------------------+
| 'ääliö' REGEXP '^[aä]{1}[aä]{1}li[oö]{1}$'       |
+--------------------------------------------------+
|                                                1 |
+--------------------------------------------------+
1 row in set (0.03 sec)

/* Bug#70470 */
MariaDB [test]> SELECT 'abc\0def' REGEXP 'def';
+-------------------------+
| 'abc\0def' REGEXP 'def' |
+-------------------------+
|                       1 |
+-------------------------+
1 row in set (0.00 sec)

All the above answers are correct, for the first time ever.

In fact one could say that the new REGEXP handles too many characters. It’s adjusted for Unicode 6.2, a version that neither MySQL nor MariaDB support yet. This will of course be of concern to those who use the Meroitic alphabet, but they’re a fading minority.

Not solved: incompatibilities

I guess it was inevitable that, since there’s a different underlying library, the text of some error messages would change. For example I get this:

MariaDB [test]> select '1' rlike rpad('1',100,'(');
ERROR 1139 (42000): Got error 'missing ) at offset 100' from regexp

and that’s okay, but in MySQL 5.6.14 the error message is:
“ERROR 1139 (42000): Got error ‘parentheses not balanced’ from regexp”.

A bit more interesting is this query:

/* Bug#68153 */
MariaDB [test]> SELECT CHAR(126) REGEXP '[[.tilde.]]';
ERROR 1139 (42000): Got error 'POSIX collating elements
 are not supported at offset 1' from regexp

and that looks a little less okay. In MySQL 5.6.14 the same search condition doesn’t cause an error, it’s considered to be “true”.

Solved: Handling more Posix

The MariaDB folks have proudly announced that version 10.0.5 supports “recursive patterns, named capture, look-ahead and look-behind assertions, non-capturing groups, non-greedy quantifiers”. To which I’m sure that the response will be: What? How shocking that non-greedy quantifiers weren’t added ten years ago!

Well, maybe not. But when looking at new features, it’s wrong to ask merely whether there’s some advantage for the non-greedy man on the Clapham omnibus. The point is it’s supposed to be more compatible with modern Perl programs, with Posix, and with all the box-tickers who will ask: do you really support regular expressions? And if someday I reach that pinnacle of ambition, to figure out grep, then I’ll doubtless appreciate look-behind assertions in the way they deserve to be appreciated.

Unsolved: handling more standard SQL

In the operetta Die Fledermaus there’s a refrain: “Happy is the one who forgets what isn’t going to be changed”. But I, alas, can’t forget the comment that I wrote on Bug#746 ten years ago, showing there are at least six behavioural differences between REGEXP and the optional-SQL-standard operator, SIMILAR TO. The incompatibilities arise because SIMILAR TO works like LIKE, so it’s more SQL-ish, and less Posix-ish. Thus, improvements in REGEXP alone are not an advance towards more standard SQL. This would matter more if there were many other SQL implementations supporting SIMILAR TO. The only ones I know of are PostgreSQL and Firebird.

Triggers: Comparisons, New Features, and a Trick

I’ll show a chart which indicates the level of support for trigger features in major open-source DBMSs.
I’ll comment on new features in MySQL 5.7.
I’ll show how triggers can be used to abort statements which are taking too long.

Trigger features in major open-source DBMSs

Feature Firebird Ingres MySQL+MariaDB PostgreSQL
Any compound statement YES YES
Alter YES
Disable YES YES
For each statement YES YES
For each row YES YES YES YES
One trigger, multiple trigger events YES YES YES
Multiple triggers for same situation YES YES
Views’ “instead of” clause YES
Deferrable YES
Normally standardish syntax YES YES YES
“New” and “Old” variables YES YES YES YES
“When” clause or equivalent YES YES


Explanation of each column:
“Any compound statement”:
“YES” if trigger body can be something like BEGIN statement1;statement2; … END.
“-” if trigger body an only be something like a single CALL statement.
“Alter”:
“YES” if ALTER TRIGGER is legal and can do something significant (more than renaming). Non-standard.
“-” if the normal way to change a trigger is to drop and create again.
“Disable”:
“YES” if there is a persistent way to make a trigger inactive or disabled. Non-standard.
“-” if the normal way to disable a trigger is to drop it or add an IF clause.
“For each statement”:
“YES” if checking happens before/after rest of statement processing
“-” if checking is only for each row
“For each row”:
“YES” if checking happens before/after rest of row processing
“-” if checking is only for each statement
“One trigger, multiple trigger events”:
“YES” if clauses like “BEFORE INSERT OR UPDATE OR DELETE” are possible. Non-standard.
“-” if clause can only contain one verb (no OR allowed)
“Multiple triggers for same situation”:
“YES” if there can be (for example) two BEFORE UPDATE triggers on table t.
“-” if only one trigger is possible for the same action at the same time.
Views’ “instead of” clause:
“YES” if INSTEAD OF triggers are possible (usually for views only)
“-” if the only trigger action times are BEFORE and AFTER
“Deferrable”:
“YES” if it’s possible to defer “after” checking until end of transaction. Non-standard.
“-” if all checking is during statement execution
“Normally standardish syntax”:
“YES” if normal CREATE TRIGGER syntax is reminiscent of standard syntax
“-” if normal/recommended syntax is CREATE RULE, or normal body is EXECUTE clause.
“New” and “Old” variables:
“YES” if a trigger can use NEW.column_name and/or OLD.column_name
“-” if column values are invisible during trigger execution
“When” clause or equivalent:
“YES” if syntax is CREATE TRIGGER … [WHEN condition] …;
“-” if condition has to be in an IF statement in the trigger body

Information for this chart comes from the documentation for Firebird 1.5, Ingres 10.0, MySQL 5.6, and PostgreSQL 9.3.

Two New Features In MySQL 5.7

The first enhancement is that it is possible to say

  CREATE TABLE t (s1 INT NOT NULL)//
  CREATE TRIGGER t1 BEFORE INSERT ON t FOR EACH ROW
  BEGIN
    IF new.s1 IS NULL THEN SET new.s1=0; END IF;
    END//
  INSERT INTO t VALUES (NULL)//

That is, NOT NULL no longer stops the INSERT if the trigger was going to change the value to non-null anyway. This was a long-standing feature request, and I doubt that anyone will complain.

The second enhancement is WL#3253: Triggers: multiple triggers per table.
In 5.6, if there are two actions to trigger, one must put them both in one trigger:

  CREATE TRIGGER t1 BEFORE INSERT ON t FOR EACH ROW
  BEGIN
    statement1
    statement2
    END

In 5.7 one can put them in two triggers:

  CREATE TRIGGER t1 BEFORE INSERT ON t FOR EACH ROW
  BEGIN
    statement1
    END
  CREATE TRIGGER t2 BEFORE INSERT ON t FOR EACH ROW
  BEGIN
    statement2
    END

This is unnecessary for a DBMS like MySQL that supports compound statements for triggers, but it’s in the standard, and there was at least one feature request for it. The problem with multiple triggers is that there has to be some way to decide which one is activated first, and the standard isn’t helpful here — it says to follow the order of creation, but that’s rigid, and also
unreliable if one depends on CURRENT_TIMESTAMP. PostgreSQL “solved” this by saying the order is the alphabetical order of the trigger names, but then Oracle 11g came along with a syntax that MySQL 5.7 copies:

  CREATE TRIGGER t2 BEFORE INSERT ON t FOR EACH ROW FOLLOWS t1 ...
  or
  CREATE TRIGGER t2 BEFORE INSERT ON t FOR EACH ROW PRECEDES t1 ...

“FOLLOWS t1” means “is activated immediately after t1”, and in INFORMATION_SCHEMA.TRIGGERS there is an integer column ACTION_ORDER which reflects this (this column is not new, but in earlier MySQL versions it is always zero).
The feature works, and the worklog task for it has good high-level architecture description. But, once you’ve made trigger t2 follow t1, you’re stuck — there’s still no ALTER TRIGGER. And I think that, even if you know what ACTION_ORDER is, you’ll still get confused. To test that, here is a quiz. After

  CREATE TABLE t (s1 CHAR(3));
  CREATE TRIGGER t1 BEFORE UPDATE ON t FOR EACH ROW
    SET new.s1 = CONCAT(old.s1,'c');
  CREATE TRIGGER t2 BEFORE UPDATE ON t FOR EACH ROW PRECEDES t1
    SET new.s1 = CONCAT(new.s1,'b');
  INSERT INTO t VALUES ('a');
  UPDATE t SET s1 = '';

Does s1 contain ‘ab’ or ‘abc’ or ‘acb’ or ‘ac’? Answer at end of post. If you get it wrong, you should continue with the old way and put all actions in a single trigger; however, the coder (apparently Mr Dmitry Shulga) deserves congratulation.

Stopping an update after 10 seconds

Of course the legitimate uses of triggers are (a) to make up for MySQL’s lack of CHECK clauses by producing an error when a NEW value is wrong; (b) to add to a summary in table b whenever there’s a change to table a.

A less tame use — because it is non-deterministic — is aborting a statement if some external condition occurs. The example here is saying, in effect, “If 10 seconds have elapsed since the update began, stop and return an error”. Since the ordinary timers like CURRENT_TIMESTAMP don’t change during a statement, I use SYSDATE.
[ UPDATE: In an earlier version of this article, I used a low-level
counter from PERFORMANCE_SCHEMA. Thanks to Justin Swanhart for suggesting a simpler way. It won’t work if the server was started with –sysdate-is-now, but my original method could also have failed in some circumstances. ]


/* Make a test table with 25 rows. */
CREATE TABLE t (s1 INT);
INSERT INTO t VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),
                     (11),(12),(13),(14),(15),(16),(17),(18),(19),(20),
                     (21),(22),(23),(24),(25);

/* Make a trigger so (current time - start time) is checked after
   each row update, and SIGNAL if the result is more than 10. */
delimiter //
CREATE TRIGGER abort_after_elapsed_time BEFORE UPDATE ON t FOR EACH ROW
IF TO_SECONDS(SYSDATE()) - @start_time > 10 THEN
  SIGNAL SQLSTATE '57005' SET message_text='Timeout';
  END IF;//
delimiter ;

/* Set a variable which will contain the current time. */
/* This should be done just before doing the UPDATE. */
SET @start_time = TO_SECONDS(SYSDATE());

/* Do a very slow UPDATE which would take
   (25 rows * 1 second per row) if no timeout. */
select current_timestamp;
UPDATE t SET s1 = SLEEP(1);
select current_timestamp;

Example result showing “Timeout” error occurs after about 10 seconds:


MariaDB [test]> select current_timestamp;
+---------------------+
| current_timestamp   |
+---------------------+
| 2013-10-07 12:54:39 |
+---------------------+
1 row in set (0.00 sec)

MariaDB [test]> UPDATE t SET s1 = SLEEP(1);
ERROR 1644 (57005): Timeout
MariaDB [test]> select current_timestamp;
+---------------------+
| current_timestamp   |
+---------------------+
| 2013-10-07 12:54:50 |
+---------------------+
1 row in set (0.00 sec)

Your mileage may vary but by luck it worked this way on my machine. Probably a progress bar would be more interesting, eh? I’ll get back to that theme if I ever finish one.

Answer to quiz question

‘ac’.

Sometimes MySQL is more standards-compliant than PostgreSQL

Here are examples comparing MySQL 5.6 against PostgreSQL 9.3 Core Distribution, where MySQL seems to comply with “standard SQL” more closely than PostgreSQL does. The examples are also true for MariaDB 10.0 so whenever I say “MySQL” I mean “MySQL and/or MariaDB”. When I say “more closely” I do not mean that MySQL is completely compliant, or that PostgreSQL is completely non-compliant.

Identifiers

Example:

CREATE TABLE          ŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽ (s1 INT); /* 32-character name */
SELECT COUNT(*) FROM information_schema.tables
  WHERE table_name = 'ŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽŽ';
SELECT COUNT(*) FROM information_schema.tables
  WHERE table_name LIKE LOWER('Ž%');

Result:
PostgreSQL says count(*) is 0. MySQL says it’s 1.
Reason:
(1) PostgreSQL maximum identifier length is 63 bytes; MySQL maximum identifier length is 64 characters. The standard requirement is 128 characters.
(2) With PostgreSQL, if you insert an invalid value, PostgreSQL truncates — it “tries to make do” rather than failing. (If the name had been too long for MySQL, it would have thrown an error.)
(3) PostgreSQL does not convert to lower case during CREATE, and does a case-sensitive search during SELECT.

Character Sets And Collations

Example:

CREATE TABLE t (s1 CHAR(1) CHARACTER SET utf16);

Result:
PostgreSQL returns an error. MySQL succeeds.
Reason:
PostgreSQL does not support named character sets; it can only handle one character set per database.

Example:

CREATE TABLE t (s1 CHAR(2), s2 VARCHAR(2));
INSERT INTO t VALUES ('y ','y ');
SELECT * FROM t WHERE s1 = 'y';
SELECT * FROM t WHERE s2 = 'y';

Result:
PostgreSQL finds one row for the first SELECT but zero rows for the second SELECT. MySQL finds one row both times.
Reason:
PostgreSQL does not always add spaces to the shorter comparand, or remove spaces from the longer comparand. MySQL is consistent. The behaviour is optional, but it is not supposed to depend on the data type.

Example:

CREATE TABLE t (s1 CHAR(1), uca CHAR(4),utf32 CHAR(6),utf8 CHAR(8),name VARCHAR(50));
INSERT INTO t VALUES (U&'\+003044','3197','  3044','  E38184','HIRAGANA LETTER I');
INSERT INTO t VALUES (U&'\+003046','3198','  3046','  E38186','HIRAGANA LETTER U');
INSERT INTO t VALUES (U&'\+0030A4','3197','  30A4','  E382A4','KATAKANA LETTER I');
INSERT INTO t VALUES (U&'\+0030A5','3198','  30a5','  E382A5','KATAKANA LETTER SMALL U');
INSERT INTO t VALUES (U&'\+00FF72','3197','  FF72','  EFBDB2','HALFWIDTH KATAKANA LETTER I');
INSERT INTO t VALUES (U&'\+00042B','1AF1','  042B','    D0AB','CYRILLIC CAPITAL LETTER YERU');
INSERT INTO t VALUES (U&'\+0004F9','1AF5','  04F9','    D3B9','CYRILLIC SMALL LETTER YERU WITH DIAERESIS');
INSERT INTO t VALUES (U&'\+0004CF','1B4A','  04CF','    D38F','CYRILLIC SMALL LETTER PALOCHKA');
INSERT INTO t VALUES (U&'\+002C13','1B61','  2C13','  E2B093','GLAGOLITIC CAPITAL LETTER RITSI');
INSERT INTO t VALUES (U&'\+0100CC','3872',' 100CC','F090838C','LINEAR B IDEOGRAM B240 WHEELED CHARIOT');
SELECT * FROM t ORDER BY s1 COLLATE "C",uca DESC,name;
SELECT * FROM t ORDER BY s1 COLLATE "POSIX",uca DESC,name;
SELECT * FROM t ORDER BY s1 COLLATE "C.UTF-8",uca DESC,name;
SELECT * FROM t ORDER BY s1 COLLATE "en_CA.utf8",uca DESC,name;
SELECT * FROM t ORDER BY s1 COLLATE "default",uca DESC,name;
SELECT * FROM t ORDER BY s1 COLLATE "ucs_basic",uca DESC,name;
SELECT * FROM t ORDER BY s1 COLLATE "zh_CN",uca DESC,name;

Result:
With PostgreSQL, no matter which collation one chooses, one does not get a linguistic standard ordering. Here is a typical result:
psql_result
With MySQL, if one enters the same data (albeit in a different way), and chooses collation utf8mb4_unicode_520_ci, one gets a standard result.
Reason:
PostgreSQL depends on the operating system for its collations. In this case my Linux operating system offered me only 5 collations which were really distinct. I did not attempt to customize or add more. I tried all the ones that were supplied, and failed to get a result which would match the Unicode Collation Algorithm order (indicated by the ‘uca’ column in the example). This matters because the standard does ask for a UNICODE collation “in which the ordering is determined by applying the Unicode Collation Algorithm with the Default Unicode Collation Element Table [DUCET]”. MySQL is a cross-platform DBMS and does not depend on the operating system for its collations. So, out of the box and for all platform versions, it has about 25 distinct collations for 4-byte UTF8. One of them is based on the DUCET for Unicode 5.2.

MySQL’s character set and collation support is excellent in some other respects, but I’ll put off the paeans for another post. Here I’ve just addressed a standard matter.

Views

Example:

CREATE TABLE t (s1 INT);
CREATE VIEW v AS SELECT * FROM t WHERE s1 < 5 WITH CHECK OPTION;

Result:
PostgreSQL returns an error. MySQL succeeds.
Reason:
MySQL has some support for core standard feature F311-04 “Create view: with check option”. PostgreSQL does not.

Transactions

Example:

START TRANSACTION;
CREATE TABLE t (s1 SMALLINT);
INSERT INTO t VALUES (1);
INSERT INTO t VALUES (32768); /* This causes an error. */
COMMIT;
SELECT * FROM t;

Result:
MySQL finds a row containing 1. PostgreSQL finds nothing.
Reason:
PostgreSQL rolls back the entire transaction when it encounters a syntax error. MySQL only cancels the statement.
Now, PostgreSQL is within its rights — the standard says that an implementor may do an “implicit rollback” for an error. But that is a provision for what a DBMS implementor MAY do. From other passages in the standard, it’s apparent that the makers didn’t anticipate that a DBMS would ALWAYS do it, even for syntax errors. (For example it says: “exception conditions for transaction rollback have precedence over exception conditions for statement failure”.) Even Microsoft SQL Server, in its optional “abort-transaction-on-error” mode, doesn’t abort for syntax errors. So MySQL appears to be nearer the spirit of the standard as generally understood.

Example:

CREATE TABLE t (s1 INT);
INSERT INTO t VALUES (1);
COMMIT;
START TRANSACTION;
SELECT CURRENT_TIMESTAMP FROM t;
/* Pause 1 second. */
SELECT CURRENT_TIMESTAMP FROM t;

Result: PostgreSQL shows the same timestamp twice. MySQL shows two different timestamps.
Reason:
PostgreSQL keeps the same time throughout a transaction; MySQL keeps the same time throughout a statement.
The key sentences in the standard say that the result of a datetime value function should be the time when the function is evaluated, and “The time of evaluation of a datetime value function during the execution of S and its activated triggers is implementation-dependent.” In other words, it’s supposed to occur during the execution of S, which stands for Statement. Of course, this leads to arguable matters, for example what if the statement is in a function that’s invoked from another statement, or what if the statement is within a compound statement (BEGIN/END block)? But we don’t have to answer those questions here. We just have to observe that, for the example given here, the DBMS should show two different timestamps. For documentation of how DB2 follows this, see ibm.com.

Stored Procedures, Functions, Triggers, Prepared Statements

Example:

CREATE FUNCTION f () RETURNS INT RETURN 1;

Result:
PostgreSQL returns an error. MySQL succeeds.
Reason:
PostgreSQL does not support any functions, stored procedures, or triggers with standard syntax.
Instead PostgreSQL supports Oracle syntax. This is not as bad it sounds — the Oracle syntax is so popular that even DB2 also has decided to support it, optionally.

Example:

PREPARE stmt1 FROM 'SELECT 5 FROM t';

Result:
PostgreSQL returns an error. MySQL succeeds.
Reason:
PostgreSQL doesn’t support the standard syntax for PREPARE.

Data Types

Example:

CREATE TABLE t (s1 CHAR(1));
INSERT INTO t VALUES (U&'\+000000');

Result:
PostgreSQL returns an error. MySQL succeeds, although MySQL has to use a different non-standard syntax.
Reason:
PostgreSQL has an aversion to CHR(0) the NUL character.(“The character with the code zero cannot be in a string constant.”). Other DBMSs allow all characters in the chosen character set. In this case, the default character set is in use, so all Unicode characters should be okay.

Example:

CREATE TABLE t (s1 BINARY(1), s2 VARBINARY(2), s3 BLOB);

Result: PostgreSQL returns an error. MySQL succeeds.
Reason:
PostgreSQL does not support BINARY or VARBINARY or BLOB. It has equivalent non-standard data types.

The Devil Can Cite Scripture For His Purpose

I could easily find examples going the other way, if I wrote a blog post titled “Sometimes PostgreSQL is more standards-compliant than MySQL”. But can one generalize from such examples?
The PostgreSQL folks boldly go to conclusion mode:

As a quick summary, MySQL is the “easy-to-use, web developer” database, and PostgreSQL is the “feature-rich, standards-compliant” database.

postgresql.org FAQ

Follow I do not dare.

Representing Sex In Databases

The MySQL Reference Manual advises:


Use a CREATE TABLE statement to specify the layout of your table:
mysql> CREATE TABLE pet (name VARCHAR(20), owner VARCHAR(20),
-> species VARCHAR(20), sex CHAR(1), birth DATE, death DATE);

Several types of values can be chosen to represent sex in animal records, such as ‘m’ and ‘f’, or perhaps ‘male’ and ‘female’. It is simplest to use the single characters ‘m’ and ‘f’.

I will supplement that with this post about representing sex (gender) in people records. I have to consider the name, the possible values, the data type, and the constraints. It won’t be simplest.

Should the column name be ‘sex’?

First consider Naming Conventions. The word ‘sex’ is not too long, its characters are all Latin, there is no need to “delimit”, using lower case is okay, making it plural (‘sexes’) would be a mistake, and it’s not a reserved word. So far so good.

Next consider the alternative term ‘gender’. Some people make a distinction between gender and sex. In that case, ‘sex’ is better if only a physical characteristic matters, but ‘gender’ is better if you’re also taking into account the behaviour, self-identification, and washroom preference. However, gender’s primary dictionary meaning is as a grammatical category, and it is a far less popular word.

Conclusion: yes, the column name should be ‘sex’.

Should the possible values be ‘m’ and ‘f’?

Now consider interoperability. If there are application programs that use the database, what checkboxes are they likely to display? If the data must go to some other body (the government, say), do they tell you what the choices are? This is where standards come in. For pretty well anything that’s commonly used to describe a person (age, occupation, income, and of course sex) there will be a standard code list.

The German Wikipedia article about “Data standards for description of sex” describes many of them. It’s worth a look even if your German is weak, because most of the charts include English translations. I’ll try to use primary sources and then generalize:

ISO 5218 is the international standard, and gets a recommendation in a chapter about sex in “Joe Celko’s Data, Measurement and Standards in SQL”. The problem is it only uses digits. This eliminates the worry that it will look biased towards West European languages, but then it looks biased towards males (because ‘1’ comes before ‘2’), and on that basis has been called “offensive”.

More seriously, a digit has no mnemonic value and that means inputters and readers would have to learn more, or developers would have to write more.
Let’s look instead for something that the other “standards” can agree on.

They all say ‘M’ = male and ‘F’ = female, so that’s indisputable.

Most of them say there’s at least one other category, and although it is disputable it is prudent to add ‘O’ = other, forestalling complaints like the lack of ‘fluidity’ objection from the Gay, Lesbian, Bisexual, and Transgender Round Table of the American Library Association.

Most of them say there are codes for “unknown” and “not applicable”, and this should be disputed. Such codes are necessary for simple lists, but for SQL databases they are unnecessary because NULLs exist. If we said that the ‘sex’ column has special values for “unknown” or “not applicable”, but ‘owner’ and ‘species’ do not, we would need inconsistent rules. Therefore “unknown” and “not applicable” are to be handled with NULL.

They all say that ‘M’ and ‘F’ are upper case. Of course it might not matter in MySQL or MariaDB, where the default collation is case insensitive. But other DBMSs are stricter, and we’re thinking about interoperability here, so lower case codes like ‘m’ and ‘f’ are not acceptable equivalents.

Conclusion: allow ‘M’, ‘F’, ‘O’, and NULL.

Should the data type be CHAR(1)?

Now consider what best fits for the range of values, without totally forgetting efficient storage.


  • ENUM(‘M’,’F’,’O’)
    This has the advantage that input errors, like a sex value of ‘!’, won’t be allowed — the closest that MySQL can come to a CHECK clause. And it’s handy that ENUMs can be searched as numerics, which for this definition would be 1 = M and 2 = F, the same as the ISO/IEC requirement. But ENUM is not standard.
  • SMALLINT or TINYINT
    Well, any numeric type would be okay for storing an ISO-5218-style digit. However, saying it’s SMALLINT is telling the world “it’s a number”. Just because a digit can appear in a numeric column doesn’t mean that any of the things that are meaningful for numbers (averaging, multiplying, etc.) should be applicable to ‘sex’.
  • BOOLEAN
    MySQL’s BOOLEAN is just a synonym for TINYINT(1) so is of no use. If we were talking about the real idea of a Boolean — a binary TRUE/FALSE — we’d better change the column name to reflect that the answers are true or false — instead of ‘sex’, the column name should be ‘is_male’.
  • VARCHAR(5)
    This is a bit more flexible than CHAR(1) if one fears that someday there will be a mandate to store as CID 7455 which has 5-character codes, but VARCHAR might need more space.
  • BIT(1) and CHAR(0)
    Speaking of needing space, a bunch of BIT columns might be more economical for storage than a bunch of CHAR(1) columns, with MyISAM. And for the CHAR(0) trick, see a 2008 post from Percona. BIT(1) and CHAR(0) would only be for people desperate to save bytes; in other respects they are the worst choice.

Conclusion: yes, the data type should be CHAR(1).

Should there be constraints to prevent bad input?

Since we won’t use ENUM, and we can’t use a CHECK clause, how can we stop people from putting in something other than ‘M’, ‘F’, ‘O’, or NULL?
With old versions we’d have to make views with CHECK OPTION clauses, or perhaps foreign-key references. Nowadays we can make triggers. For example:


DELIMITER //
CREATE TRIGGER check_sex AFTER INSERT ON pet FOR EACH ROW BEGIN
  IF new.sex COLLATE latin1_bin NOT IN ('M','F','O') AND new.sex IS NOT NULL THEN
    SIGNAL SQLSTATE '22222' SET MESSAGE_TEXT='sex must be ''M'' or ''F'' or ''O'' or NULL';
    END IF;
  END//

For historical reasons that’s not a MySQLish practice, though. Ordinarily I’d expect that people will hope that application code will keep the input clean, and will do periodic cleanups when they find that application code does not keep the input clean.

The Real Point

The MySQL manual is correct about the way to design a “My Pets” database. But if a database description will consistently map to items in the real big wide world, there are going to be multiple things to consider.

NULL and UNIQUE

When I worked for MySQL I saw frequent complaints that the UNIQUE constraint didn’t stop users from inserting NULLs, multiple times. For example:
CREATE TABLE t (s1 INT, UNIQUE (s1));
INSERT INTO t VALUES (NULL);
INSERT INTO t VALUES (NULL); /* This does not cause a “unique constraint violation” error */

There are now eleven generously-commented bugs.mysql.com reports:
#5685, #6829, #7479, #8173, #9844, #17825, #19377, #25544, #27019, #27376, #66512. The essential points (along with observations about how we were stupid or deaf or lazy) are:

  • the ANSI/ISO SQL standard said we should throw an error,
  • all the other major DBMSs would throw an error,
  • and throwing an error would be more sensible and convenient.

The first point is false; the second point depends what “major” means; the third point is a valid opinion.

I will now attempt to beat this subject to death with enough thoroughness that it will become dead.

Exactly what the ANSI/ISO standard says, and how it applies

Our example will be a table named t created thus:
CREATE TABLE t (s1 INT, UNIQUE (s1));
INSERT INTO t VALUES (NULL),(NULL);

So t looks like this:

+------------+
| s1         |
+------------+
| NULL       |
| NULL       |
+------------+

Our question will be: is the constraint “UNIQUE (s1)” violated according to the SQL standard?

I’ll use a recent draft of SQL:2011 because it’s current and it’s available for all to see, at:
jtc1sc32.org/doc/N1951-2000/32N1964T-text_for_ballot-FCD_9075-2.pdf.
The wording is about the same in all earlier versions of the standard since SQL-89, which didn’t allow NULLs at all for UNIQUE constraints.

Section 11.7 says:

[Syntax Rules]
4) … let SC be the <search condition>:
UNIQUE ( SELECT UCL FROM TNN )

where UCL means “Unique Column List” and TNN means “Table Name”.
So for our example, search condition SC is “UNIQUE (SELECT s1 FROM t)”.
[General Rules]
3) The unique constraint is not satisfied if and only if
EXISTS ( SELECT * FROM TNN WHERE NOT ( SC ) )
is True.

… Plugging in our values, that means

3) The unique constraint is not satisfied if and only if
EXISTS ( SELECT * FROM t WHERE NOT ( UNIQUE (SELECT s1 FROM t) ) )
is True.

So now we have to know how “UNIQUE(…)” is supposed to work.
Section 8.11 says:

If there are no two rows in T such that the value of each column in one row is non-null and is not distinct from the value of the corresponding column in the other row, then the result of the <unique predicate> is True; otherwise, the result of the <unique predicate> is False.

… where T is the table.

Now, apply the predicate “UNIQUE (s1)” to table t.
For either Row#1 or Row#2: Is the value of column s1 non-null?
Answer: NO. It is NULL.
Therefore the requirement “that the value of each column in one row is non-null” is met for zero rows in t.

But the <unique predicate> is only False if the requirement is met for two rows in t. Therefore the result of the <unique predicate> is True.

In other words,
UNIQUE (s1) is True.

Therefore
NOT (UNIQUE (s1)) is False.

Therefore
SELECT * FROM TNN WHERE NOT (UNIQUE (t)) returns zero rows.

Therefore
EXISTS ( SELECT * FROM TNN WHERE NOT ( UNIQUE (t) ) ) is false.

But Rule 3) says that the unique constraint is not satisfied if and only if that EXISTS clause is True.
Therefore the constraint is satisfied.
And when a constraint is satisfied, it is not violated.

Thus, according to the rules for <unique constraint definition>, our example table violates no constraint.

It’s tedious to work through this because there are multiple negatives, but it is not ambiguous.

Why, then, do some people have trouble? Partly because they look at a different statement in the introductory sections. Section 4.17 Integrity constraints says:

A unique constraint is satisfied if and only if no two rows in a table have the same non-null values in the unique columns.

And I have to admit that it’s not at all obvious what that’s supposed to mean. But, since you have now seen what is in the later and more detailed sections, you should read it as “no two rows have unique-column values which are the same (an undefined word which probably means equal) and which are also non-null”.

What other DBMSs do

Trudy Pelzer and I wrote this on page 260 of our book, SQL Performance Tuning:

DBMS Maximum number of NULLs when there is a UNIQUE constraint
IBM (DB2) One
Informix One
Ingres Zero
InterBase Zero
Microsoft (SQL Server) One
MySQL Many [although the BDB storage engine was an exception]
Oracle Many
Sybase One

DB2 has an optional clause which can force standard behaviour.
Oracle is only allowing multiple NULLs because it doesn’t put NULLs in indexes; if we were talking about a multiple-column constraint, Oracle would be non-standard.

The book didn’t cover PostgreSQL and Access and Firebird, which all allow “Many” NULLs. A newer entrant, NuoDB, allows only one NULL (I told them that’s non-standard but they didn’t believe me).

Most of the complainers were apparently thinking of Microsoft SQL Server. So it’s interesting that on Microsoft’s discussion boards the complaints are (spoiler alert!) that SQL Server doesn’t follow the standard and should be more like other DBMSs. See, for example, NULL in unique index (where SQL Server gets it wrong) and Change UNIQUE constraint to allow multiple NULL values.

According to SQL Server Magazine, in 2008,

ANSI SQL supports two kinds of UNIQUE constraints — one that enforces uniqueness of NULLs just like with known values, and another that enforces uniqueness of known values but allows multiple NULLs. Microsoft SQL Server implemented only the former.

As we’ve seen, that’s false, and so is another claim, that “Informix and Microsoft SQL Server follow the other interpretation of the standard.” But perhaps Microsoft’s stories help explain people’s beliefs.

What is more sensible

An unfortunate defence of the allow-many-NULLs behaviour is that “Null values are not considered equal”. That’s what’s happening in effect in this particular case, but it’s not a general or useful rule.

  • (a) if we were doing a NULL/NULL comparison, or a NULL/not-NULL comparison, the result would be UNKNOWN. For other constraints an UNKNOWN result is considered a violation.
  • (b) “equality” is the wrong concept to consider here, it is more important to consider whether the values are “not distinct” — and in fact two NULL values are not distinct.
  • (c) NULLs sort together for purposes of GROUP BY, ORDER BY, or duplicate elimination.

Therefore, the UNIQUE rules do not follow automatically from how NULLs behave elsewhere in SQL. That is why the standard had to add wording to specify that the uniqueness predicate applies only for “non-null” values.

So, regarding what looks more sensible, it seems that the complainers have a point.

What is more convenient?

It is better, now, to carry on as before. The current behaviour is the de facto and the de jure standard. It is neither a bug nor undesirable.

But would it be even better if MySQL or MariaDB allowed an option? The DB2 syntax to allow UNIQUE with many NULLs is:
CREATE TABLE t (s1 INT UNIQUE WHERE NOT NULL)
so presumably the way to specify UNIQUE with one NULL would be:
CREATE TABLE t (s1 INT UNIQUE WHERE NULL OR NOT NULL)
but it would be easier to understand if we said
CREATE TABLE t (s1 INT UNIQUE AND MAXIMUM NUMBER OF NULLS = 1).

Anyway, it would be possible. But a nice first step would be to acknowledge that it is a feature request rather than a bug.

Copyright (c) 2013 by Ocelot Computer Services Inc. All rights reserved.

← Previous page