SQL And Its Sequels – Page 8 – Musings On MySQL / MariaDB Features and SQL / NoSQL Open Source DBMSs

MySQL’s Date Support Is Not Absurd

MySQL has unfortunate rules for dates with two-digit years, for the date range, for post-decimal digits, for timestamp defaults … But other DBMSs can be equally bizarre

Two-digit years

Start with two-digit years. Summarizing from the manual:

If the two-digit value is <= 69, add 2000. Else add 1900. For example, if it's 65, then it's 2065.

It’s an arbitrary extra detail that users have to learn. But PostgreSQL has exactly the same rule. And three other DBMSs — DB2, Oracle, Firebird — have an even more complex rule, saying in effect that the interpretation will vary depending what year it is when you do the input. SQL Server is even worse, it allows users to change the rule slightly. Only the SQL standard has a rule that’s easy to learn and never changes: you have to enter four-digit years. Time for a modernized-English Shakespeare quote:

Hamlet: Why was he sent into England?
First Clown: Why, because he was mad. He shall recover his wits there, or, if he does not, it is no great matter there.
Hamlet: Why?
First Clown: It will not be seen in him there. There the men are as mad as he.

Applying the clown’s logic, we see that MySQL’s behaviour, in context, is no great matter.

Range

Different DBMSs have different minimum dates, though they mostly have the same maximum.

                                Minimum                   Maximum
Oracle 12c(timestamp)           4713-01-01 BC Julian      9999-12-31                          
SQL Server 2012 (datetime)      1753-01-01 AD Gregorian   9999-12-31
SQL Server 2012 (datetime2)     0001-01-01 AD Gregorian   9999-12-31
DB2 9 (timestamp)               0001-01-01 AD Gregorian   9999-12-31    
Firebird                        0100-01-01 AD             32768-02-29 
PostgreSQL (timestamp)          4714-11-24 BC Gregorian   294276-12-31
MySQL (datetime)                1000-01-01 AD Gregorian   9999-12-31

At first glance it looks as if MySQL is out of step here — everybody else can store dates before 1000 AD. (SQL Server was out of step too, until Microsoft
introduced the datetime2 data type in 2008.) But actually MySQL can store dates before that. Just don’t try calculating on them, goes the recommendation. However, Oracle, and Oracle alone, switches to the Julian calendar for dates before October 1582. I think that’s becoming the conventional thing to do, even though (sob) it’s not standard SQL. Therefore, yes, MySQL is out of step, but so is everybody else except Oracle.

Post-decimal digits

MySQL version 5.6.4 saw the implementation of WL#946 TIME/TIMESTAMP/DATETIME with fractional seconds. Once again Oracle has hidden the task description but once again it’s visible on wayback (as always, click “High Level Architecture” to see what the real spec is). In this case the implementation does not match the original spec. MySQL only supports 6 post-decimal digitsr (microseconds), and the same is true for MariaDB.

How bad is that? Let’s compare:

                                     Maximum Fraction
DB2 9.7                              .999999999999                 
Oracle 12c                           .999999999
SQL Server 2012                      .99999999
PostgreSQL 9.3                       .999999
MySQL 5.6                            .999999
Firebird                             .9999

I look at that comparison and exclaim “MySQL is behind the Big Three Closed-Source DBMSs”. But I acknowledge that others could look at the same comparison and shrug “Oh well, it’s as good as the other open-source DBMSs”. Merging the opinions, I’ll say: post-decimal digit support is not good but, since 5.6.4, not bizarre.

Yes, But

Yes, but what about the fact that, in MySQL, the date 0000-00-00 is legal but becomes NULL if you retrieve it via ODBC or if you assign NULL to a TIMESTAMP NOT NULL, the result is the current timestamp and other equally odd-looking rules?

Here I can’t defend MySQL on the basis that some other DBMS does it. These are MySQL innovations. And I have never heard somebody from outside MySQL say that they have been helpful in a serious project. I can only say, well, at least they’re avoidable.

The claim is just that MySQL date support is not bizarre, in relative terms. The other DBMSs have quirks too. It’s perhaps inevitable in this imperfect world which refuses to go around the sun in a regular and divisible-by-10 number of days.

pgulutzan, February 16, 2014. Category: MySQL.

MySQL, GROUP BY, Select lists, and Standard SQL

The MySQL manual says: “In standard SQL, a query that includes a GROUP BY clause cannot refer to nonaggregated columns in the select list that are not named in the GROUP BY clause.”

I got questioned about that last week, and this week I see that there’s a proposed change in behaviour for MySQL 5.7, so I guess it’s a good time to try to figure out what’s going on in this part of the world.

I checked the manuals for most current DBMSs and confirmed that the ONLY_FULL_GROUP_BY kind of behaviour is required by DB2, Derby, Firebird, HSQLDB, Informix, Mimer, Oracle, SQL Server, and Teradata. The only exceptions are Sybase and SQLite, and I’ll mention PostgreSQL later.

But returning an error is not really absolutely always the right thing.

Suppose your table was made this way:
CREATE TABLE t (a INT PRIMARY KEY, b INT);
And suppose you make this query:
SELECT a, b FROM t GROUP BY a;
This causes an error if sql_mode = ONLY_FULL_GROUP_BY.
But is it really an error?
The trick here is: if column a is the PRIMARY KEY (or a UNIQUE NOT NULL key), and you’re grouping by a, then it is logically impossible — impossible — that there could be two different values of column b.

One could say that “a determines b”, or that there is a “functional dependency” here. Roland Bouman talked about the matter in 2007 in an article “Debunking GROUP BY myths”. He made the point that, in fact, the current SQL standard does not really absolutely always require that there be an error if there’s a functional dependency.

So let’s look a bit more closely at what the standard says now. Sample:

2) Without Feature T301, “Functional dependencies”, in conforming SQL language, if QCR is a group-invariant column reference, then QCR shall be a reference to a grouping column of the qualifying query of QCR.

Arm-waving our way past the terminology, this kind of thing means that ONLY_FULL_GROUP_BY behaviour is a requirement — for any implementation that doesn’t claim to support Feature T301. Feature T301, eh? That’s not a core feature, in other words no SQL DBMS is required to support T301 in order to claim ANSI/ISO conformance. But it is standard. So one could argue that the MySQL manual is slightly wrong, it shouldn’t suggest that “the standard” requires ONLY_FULL_GROUP_BY. On the other hand, MySQL’s behaviour without ONLY_FULL_GROUP_BY is obviously non-standard, because there’s no checking whether something is “functionally dependent” or not.

What is to be done, then?

Well, it’s rather short notice to say that something will be in 5.7. There were occasional bitter experiences when incompatible behaviour was rushed into MySQL, and I thought the lessons were learned. I was wrong. I should have listened to Mr Kipling:

As it will be in the future, it was at the birth of Man
There are only four things certain since Social Progress began.
That the Dog returns to his Vomit and the Sow returns to her Mire,
And the burnt Fool’s bandaged finger goes wabbling back to the Fire.

Sure, but shouldn’t MySQL do something about that “functional dependency” stuff?

Well, PostgreSQL did. For the sort of item that I used in the example above, which by the way is called a “BPK-dependency” (base table primary key dependency), recent versions of PostgreSQL will say okay. As a result PostgreSQL now claims “partial” compliance with feature T301. But only partial!

The difficulty is that there are so many possibilities, as soon as you start to consider what should happen with views, left or right joins, subsets of multi-column keys, rollup, subqueries, and the many ways that a functional dependency can become known, besides the simple BPK case. Trudy Pelzer and I didn’t really get to grips with these arcana in our book “SQL-99 Complete, Really”, but Jim Melton and Alan Simon did in their book “SQL:1999 Understanding Relational Language Components”, for several dense pages. I’ll just quote the last sentences:

We realize that the material in this section is complex and abstract in the extreme. We have included it in spite of the complexity to serve readers who need to understand SQL:1999’s enhancements in updatability and who are sufficiently involved with data modeling to have or pursue the background in relational theory. However, in practical terms, we have rarely found ourselves using knowledge of all of these functional dependencies when writing our applications.

I think that it will be a long time before PostgreSQL will have full support, and this will cause confusion — people who understand the rationale will expect queries to work whenever they’re rational according to functional-dependency thoughts, and they won’t work. That doesn’t mean, though, that PostgreSQL is wrong — as I’ve said before, complaints about “half-baked features” are typically wrong. I see that the Derby folks are considering following the PostgreSQL lead. And maybe MySQL someday will too. There’s a worklog task, WL#2489, Fully compliant ONLY_FULL_GROUP_BY mode. Once again Oracle has hidden the task, once again it’s visible on wayback, but WL#2489 had no high-level architecture anyway. Well, that means the plan can be flexible.

And the bottom line is: MySQL should change, but no rush. Oracle should recall Saint Augustine’s prayer: “Lord, grant me chastity and continence, but not yet.” A variation could be: Lord, let me support standard SQL by default with Feature T301 … but not in 5.7.”

pgulutzan, January 26, 2014. Category: MySQL.

Roles Review

A role is a bundle of GRANTed privileges which can be assigned to users or which can take the place of users. When there are hundreds or thousands of users, administration becomes a horrible burden if the DBMS doesn’t support roles. Our online ANSI-standard book has syntax descriptions. Other DBMSs do support roles, and MySQL has had a worklog task “WL#988 Roles” for many years. Earlier attempts to implement them included a Google Summer of Code project and a MySQL tool released by Google.

Now another google-summer-coder, Vicențiu Ciorbaru, has put together something which will be in MariaDB. MariaDB’s official blog says this is “a big thing”. I’ll compare the MySQL specification to the MariaDB feature.

Unfortunately Oracle made WL#988 Roles a hidden task about a year and a half ago — you can’t find it by looking on dev.mysql.com/worklog. Fortunately the Wayback Machine had crawled the page in 2012 — you can find it by looking on http://web.archive.org/web/20120508072213/http://forge.mysql.com/worklog/task.php?id=988. Remember that worklog pages are tabbed, and you must click the “High Level Architecture” button to see most of the specification.

I’m using MariaDB 10.0.8 built from source. This is an early version and changes will happen before General Availability happens. Although MariaDB has its own worklog task which I’ll refer to as MDEV-4397, I prefer WL#988 for obvious reasons. For this review I’ll use the same order and same section names as WL#988 high-level architecture. It will be hard to follow unless you read WL#988 first.

CREATE ROLE

“CREATE ROLE role_1;” works. Good.

“CREATE ROLE IF NOT EXISTS role_2;” does not work. I’m ambivalent about this. I don’t like the IF NOT EXISTS clause myself, but I thought it was the way of the future, according to another MySQL Worklog task, WL#3129 Consistent clauses in CREATE and DROP.

Here’s the first bug: I get a crash if I use a bad name:

MariaDB [test]> CREATE ROLE ``;
ERROR 2013 (HY000): Lost connection to MySQL server during query

Notes about role names

MariaDB allows role names to be wildcards or privilege names, for example

MariaDB [test]> CREATE ROLE super;
Query OK, 0 rows affected (0.00 sec)
MariaDB [d]> create role '%';
Query OK, 0 rows affected (0.00 sec)

I regard that as a flaw. Somewhere, ages hence, somebody will get confused because these names have special meanings.

DROP ROLE

“DROP ROLE role_1;” works. Good.

“DROP ROLE IF EXISTS role_1;” does not work. Once again I’m ambivalent. WL#988 specifies that a NOT EXISTS clause should be permissible. On the other hand, “DROP USER IF EXISTS u;” doesn’t work either.

DROP ROLE does not take effect immediately. If a user has enabled role_x, then that user continues to have the privileges of role_x even after role_x is dropped. I regard that as a flaw. [UPDATE: it’s normal, see the comments.]

Here’s the second bug: DROP USER crashes if I specify a role name:

MariaDB [test]> CREATE ROLE role_55;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> DROP USER role_55;
ERROR 2013 (HY000): Lost connection to MySQL server during query

GRANT privilege TO role_name

“GRANT SELECT ON t TO role_1;” works. Good.

“GRANT PROXY ON a TO role_1;” works too. I’m ambivalent yet again. As WL#988 says, it’s meaningless to grant a privilege to a role if it only can affect a user. There will always be an argument between people who think this must mean the user is making a mistake, and people who think it doesn’t do any harm.

The privilege change does not take effect immediately. Consider what WL#988 says:

“The privilege change takes effect for the next statement that any affected user executes. (A user is affected if he/she has an enabled role that is affected by the privilege change.)

That seems reasonable — after all, if I say “GRANT … TO user_x” then user_x gets the privilege immediately. But it doesn’t happen for roles. If I say “GRANT … TO role_x”, then role_x gets the privilege — but users who have enabled role_x do not. I regard this as a flaw. [UPDATE: my test was bad, I was not granting the same type of privilege to both the role and the user, see the comments.]

Here’s the third bug: if a user and a role have the same name, GRANT only works on the role. If I say

CREATE USER u_1;
CREATE ROLE u_1;
GRANT INSERT ON *.* TO u_1;
SELECT host,user,insert_priv,is_role FROM mysql.user WHERE mysql.user.user='u_1';

I’d expect that INSERT would be granted to both the user and the role. It’s not, it’s granted only to the role. Moral: users and roles should not have the same names!

GRANT role_name TO user_name

“GRANT role_1 TO user_1;” works. Good. “GRANT role_1 ON *.* TO user_1;” would be good too but an [ON *.*] clause should never be necessary.

“GRANT role_1 TO user_1 WITH GRANT OPTION;” does not work. I’m ambivalent yet again. What we actually have is an ability to say “GRANT role_1 TO user_1 WITH ADMIN OPTION;” and that’s okay (WITH ADMIN OPTION is a standard clause), but I’m not sure it’s a good idea that it’s the default.

“GRANT role_1, role_2 to user_1;” does not work. I regard this as a flaw, but it might have been tough to implement. (UPDATE: 2016-03-20: It is indeed tough to implement. Two months after this blog comment was made, since the manual said that it would work, this flaw was reported as a bug: MDEV-5772 Granting multiple roles in single statement does not work. But two years later it was still unresolved.)

Who has GRANT role_name privileges?

As WL#988 puts it:

Some People will say that, if Peter said CREATE ROLE Role1, Peter should automatically have GRANT Role1 privilege. The analogy is with the way that we grant for routines. This is what would happen with Oracle.
Other People will say that, no, the only way for Peter to get GRANT Role1 privileges is if somebody grants to Peter. The analogy is with the way that we grant for tables.

The implementer decided to follow the advice of Some People for this one. Good.

GRANT role_name TO role_name

“GRANT role_1 TO role_2;” works. Good.

“GRANT role_1 TO role_1;” does not work — and that’s good too. This is a special instance of what WL#988 calls a “cyclical GRANT”. Nobody would want that.

GRANT CREATE / ALTER / DROP ROLE

“GRANT CREATE ROLE …” does not work. I regard that as a flaw.

WL#988 has quite a bit of verbiage about how there has to be a CREATE ROLE privilege, and why. MariaDB sidesteps — the decision was that all you need is a CREATE USER privilege. Although I cannot think of a way that this could lead to security breaches, I am fairly sure that cases exist where administrators who want to allow user-creation do not want to allow role-creation, and vice versa.

REVOKE

“REVOKE role_1 FROM user_1;” works. Good.

SET ROLE

“SET ROLE role_1;” works. And it only works if somebody previously said “GRANT role_1 TO user_1;” for the user that’s doing the SET ROLE. Good.

“SET role ‘role_1′;” works, that is, the role name can be a string literal. But “SET @x=’role_1’; SET ROLE @x;” does not work. I regard that as a flaw because it’s a lack-of-orthogonality thing. By that I mean: if a string literal is okay in a certain place in an SQL statement, then The User On The Clapham Omnibus will expect that a string variable would be okay in the same place.

“SET ROLE role_1,role_2;” does not work. That’s because MariaDB, unlike Oracle, doesn’t allow assignment of multiple roles. And that’s what WL#988 asks for, so no complaints here.

SET ROLE DEFAULT does not work. I regard that as a flaw. Yes, I recognize there is no non-null default, but that just means that SET ROLE DEFAULT should have the same effect as SET ROLE NONE.

Big Example

The big example didn’t work due to incompatible syntax, but I see that there is some evidence that testing has been done. Good.

CURRENT_ROLE

“SELECT CURRENT_ROLE;” works. Good. As expected, it’s possible to find out what the last “SET ROLE” statement did.

“SELECT CURRENT_ROLE();” also works. I regard this as a flaw but I suppose it was inevitable. After all, “SELECT CURRENT_USER();” works.

CURRENT_ROLE is not on the current reserved words list and this will cause a minor incompatibility with MySQL.

ALTER ROLE

“ALTER ROLE role_1 …” does not work. Good.

ALTER ROLE exists in Oracle for changing the IDENTIFIED BY clause, but MariaDB doesn’t support the IDENTIFIED BY clause, so ALTER ROLE would have nothing to do.

RENAME ROLE

“RENAME ROLE role_1 TO role_2;” does not work. Good.

Although “RENAME USER user_1 TO user_2;” works, WL#988 reminds us that renaming a role is going to have cascading effects that wouldn’t exist for renaming a user.

SHOW PRIVILEGES

“SHOW PRIVILEGES;” is the same as it used to be. Good.

Since MariaDB doesn’t have a CREATE ROLE privilege, it doesn’t have to list it.

SET DEFAULT ROLE

“SET DEFAULT ROLE role_1 TO user_1;” does not work. I regard this as a flaw.

It’s a good thing if, when user_1 connects, user_1 automatically gets the privileges associated with a default role. It’s less good if user_1 has to do her own SET ROLE whenever she connects, probably with mysql –init_command=’SET ROLE role_1′. The point of roles was supposed to be the elimination of administrator hassle, and –init-command won’t help there.

The Initial State

The initial state when a user connects is, effectively, SET ROLE NONE. That’ll have to do for now.

SHOW

“SHOW GRANTS FOR user_1;” works and shows roles too. Good.

“SHOW GRANTS FOR CURRENT_ROLE;” also works. WL#988 failed to suggest this obvious addition, so maybe I should call it: better than good.

INFORMATION_SCHEMA

“SELECT * FROM INFORMATION_SCHEMA.ENABLED_ROLES;” works. Good.

“SELECT * FROM INFORMATION_SCHEMA.APPLICABLE_ROLES;” also works. Also good.

And it does show contained roles. That is, “If role1 is contained in role2, then both roles appear.” However: It only shows the roles that have been granted for the current user, it does not show all roles. I regard that as a flaw.
[UPDATE: In an earlier version of this blog this observation was in the wrong place.] [UPDATE again: I missed what the cause was, the real problem was that I couldn’t grant to public — but that’s another bug. See comments.]

Command line options

“–role=role_name” doesn’t work (on the command line). Good.

The way to go for setting a role name at connect time is to support defaults. Any other “solution” would just get in the way.

For Replication

As WL#988 says, “The plan is to ignore replication till the last minute.” I can’t tell whether MariaDB is following such a plan.

Plan for Backup

Backup is simplified by the fact that users and roles come from the same table in the mysql database. I did not test it, so let’s say: assumed good.

Stored Procedures and Definer Privileges

After “SET ROLE role_1; CREATE PROCEDURE p () SQL SECURITY DEFINER SELECT CURRENT_ROLE; CALL p();” the result is NULL. Good.

The idea is supposed to be that, within an SQL SECURITY DEFINER routine, an implicit SET ROLE NONE takes place temporarily. It’s great that the implementer caught on to this rather obscure point.

Logging in with a Role Name

After “CREATE ROLE ‘role_5@localhost’;” or “CREATE ROLE role_5;” I was not able to log in as role_5. Good.

However, “CREATE DEFINER=role_5 PROCEDURE p2() SET @x=5;” works. I regard that as a flaw. There should at least be a warning that the DEFINER clause is being used with a nonexistent user. I get the impression from MDEV-4397 that this is deliberate behaviour, so don’t expect a fix.

DML statements and role privileges

The Oracle-style restriction for CREATE VIEW isn’t there. Good.

Other worklog tasks affected by roles

Well, WL#988 anticipated that roles would come first, and pluggable authentication support would come later. It didn’t happen that way. No problem.

Glossary

It doesn’t seem that the terminology differs from what’s expected. Good.

New columns in mysql.user

When I say “CREATE ROLE role_1;” the effect is that a new row goes into the mysql.user table. The mysql.user table definition had to be adjusted to take that into account. But I don’t think it’s all good.

WL#988 expected that the length of a role name would be 16 characters, but mysql.user.user is now CHAR(80). Well, that was a flaw, but just an old bug that wasn’t caused by introduction of roles.

The mysql.user.host column is blank (”). I’d expected it would be ‘%’. I suppose that this was intended as a way to disambiguate users and roles that had the same names, but that’s not working well anyway. The column should be ‘%’ because that means “any or all hosts”.

There is one new column at the end, named is_role, which can be ‘Y’ or ‘N’. I regard that as a flaw. The future-considerations way would have been to add a new column named user_type, which could be NULL or ‘Role’ — or something else that somebody will dream up in a future release. The way it is now, if somebody dreams up another type later, they’ll have to add yet another column in mysql.user.

Storing role names

The implementer picked “Alternative #2: (loosely based on a suggestion from Monty Widenius)”. So there’s a new table in the mysql database:
CREATE TABLE mysql.roles_mapping (Host CHAR(60), User CHAR(80), Role CHAR(80), Admin_option ENUM(‘N’,’Y’))
… And it works. Good. Although ROLE was a reserved word in SQL-99, and although WL#988 suggested an additional column and different names, it works.
However, storing the WITH ADMIN OPTION value should only be in one place and that place is mysql.user, so mysql.roles.mapping should not contain a column named Admin_option. I regard this as a flaw. [UPDATE: not it is not a flaw, see the comments.]

The Decisions about Options

The early decisions of Monty Widenius were: we don’t want passwords, we do want defaults, we don’t want multiple current roles, and PUBLIC should not be a role. That’s what ended up happening, except that there are no defaults. Good.

Overall

For all of the bugs, and for some of the flaws, there’s no worry — they’ll probably disappear. The MariaDB advantage here is the release-early policy, because there’s lots of advance time if there’s a decision to make a few corrections or course changes.

The MariaDB disadvantage is illustrated by its worklog task for roles, MDEV-4397. I’m biased, but I think it’s fair to say that MySQL’s WL#988 “high level architecture” specification covered the necessary ground and MariaDB’s MDEV-4397 did not. This means Oracle made the right decision when it “hid” its worklog task. Too bad it didn’t work out this particular time.

pgulutzan, January 12, 2014. Category: MySQL / MariaDB.

Mroonga and me and MariaDB

Chinese and Japanese and Korean (CJK) text usually has no spaces between words. Conventional full-text search does its tokenizing by looking for spaces. Therefore conventional full-text search will fail for CJK.

One workaround is bigrams. Suppose the text is

册免从冘

There should be three index keys, one for each two-character sequence:

册免, 免从, and 从冘.

Now, in a search like

SELECT * FROM t WHERE MATCH(text_column) AGAINST ('免从');

a bigram-supporting full-text index will have a chance. It’s wasteful and there will be false hits whenever the bigram isn’t really a “word”, but the folks in CJK-land have found that bigrams (or the three-character counterpart, trigrams) actually work.

One way to get bigrams for MySQL or MariaDB is to get mroonga.

Why care about Yet Another Storage Engine)?

Back in 2008 a project named Senna attracted the attention of my colleagues at MySQL, but didn’t go on to world conquest.

Since around 2011 the groonga project, billed as a “successor” to Senna, has been outing regular releases of a generic library that can connect to more than one DBMS. “Mroonga” is the “M[ySQL storage engine that interfaces with the] groonga [code]”. So, although it hasn’t been packaged with MySQL or MariaDB until now, it’s not all new or untested code. What’s new is that MariaDB will, real soon now, include mroonga as part of the regular MariaDB download. Poof, credibility at last.

I understand that mroonga has features that make it interesting:
* It’s a column-storage engine, which I suppose makes it an alternative to Infobright or InfiniDB
* Some of its users or developers also have an involvement with the Spider storage engine, which I suppose means there wouldn’t be too much trouble using the two engines in concert: mroonga for full-text and Spider for sharding.
… but I didn’t look at the column storage or the Spider compatibility. I cared only that SHOW ENGINES said it’s “CJK-ready fulltext search”.

Of course, SHOW ENGINES can also show MyISAM and InnoDB and SphinxSE, and they can all do full-text searching too. One might choose MyISAM because for a long time that was the only engine that had full-text, or choose InnoDB because it’s the default nowadays, or choose SphinxSE because it has lots of features including stemming. But those aren’t targeted for the CJK niche. For example, MariaDB has no bigrams. Sphinx’s documentation says it does have bigrams, but a glance shows instantly “those are not the bigrams you’re looking for”.

The only questions, then, are (a) does mroonga really handle full-text, and (b) does mroonga really handle CJK?

Does mroonga work?

At the time I’m writing this, MariaDB 10.0.8 doesn’t have mroonga yet. There are instructions for getting packages with Windows and Ubuntu etc., but such things never work for me — they’ll always be out of synch with something else that I’ve got, or with what the main vendor (Oracle or MariaDB in this case) is updating. So I decided to build from source. Getting mroonga was easy since I already have Bazaar:

bzr branch lp:~mroonga/maria/mroonga

and then build. Actually the initial build failed because mroonga required CMake 2.8.8 which doesn’t come with Ubuntu 12.04. But after I worked around that, I had a server and all I had to say was

                                                                             
INSTALL PLUGIN mroonga SONAME 'ha_mroonga.so';
CREATE FUNCTION last_insert_grn_id RETURNS INTEGER SONAME 'ha_mroonga.so';
SET GLOBAL mroonga_log_level=NONE;

I found more out by looking at some documentation, which is nowhere near Oracle quality but is mostly in readable English.

Then I was able to do this:

CREATE TABLE t (
  id INT NOT NULL DEFAULT '0',
  `text_column` VARCHAR(100) CHARACTER SET utf8 DEFAULT NULL,
  PRIMARY KEY (id),
  FULLTEXT KEY `text_column` (text_column)
) ENGINE=mroonga DEFAULT CHARSET=latin1 COMMENT='engine "innodb"'

It’s probably less than obvious that I’m looking at a good-looking feature. The point is: with this syntax I can use mroonga with an InnoDB table. Whoopie. Although that means I lose mroonga’s other features, I can use it without losing whatever I find good about InnoDB. And then I was able to do this:

SELECT * FROM t WHERE MATCH(text_column) AGAINST ('x');

Again it’s probably less than obvious that I’m looking at a good-looking feature. The MATCH … AGAINST syntax is idiosyncratic and unobvious — don’t expect this to arrive in an SQL standard near you any time soon. But it’s what the long-term MyISAM user is used to, so the transition isn’t painful.

Does mroonga handle CJK?

The J (Japanese) part is certainly there. The developers are Japanese. One of the optional extras is integration with MeCab which is a tool that can handle Japanese morphology — it’s like solving the “no spaces” problem by understanding a bit about the Japanese language, which after all is the solution that a human would use with no-spaces text. I don’t see, though, that there’s been equivalent attention paid for the C (Chinese) and K (Korean) parts of CJK. So I just looked at the bigrams, since they’re all that C or K could benefit from.

I created a million-row table containing randomly-chosen characters, mostly Kanji, but with a mix of Latin and Kana letters. How I made it is tedious, but perhaps somebody out there will want to know how to make randomly populated tables with such “data”, so I’ll put the code in a comment on this posting.

I had a choice between a bewildering variety of ways to tokenize. The default one, TokenBigram, did the job. The good news for me was that it didn’t do bigrams when the text was entirely Latin — that would be silly — but in a separate test I did see that it indexed Latin characters as ordinary words. So one of the worries that I had — that mroonga would be only good for CJK but not for non-CJK mixed inside the same text column — turned out not to be a worry.

And now, the essential point, the “l’essence du bigram” (which would look good on a restaurant menu, eh?) …

MariaDB [test]> SELECT COUNT(*) FROM t WHERE MATCH(text_column) AGAINST ('免从');
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

It found a row with a bigram! But is that the right count?

MariaDB [test]> SELECT COUNT(*) FROM t WHERE text_column LIKE '%免从%';
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
1 row in set (2.48 sec)

Yes, it’s the right count — a LIKE search confirms it.

The difference is that the LIKE search took 2.5 seconds because it had to scan a million rows. The mroonga full-text search took approximately zero seconds because it used an index. That doesn’t mean that mroonga is fast — not every search is a benchmark. I did find a graph that shows mroonga throughput is better than MyISAM’s or SphinxSE’s but probably if I looked hard I could find a graph that says the opposite.

And the crowd goes wild

Although I have done no benchmarks or bug hunts, I have acquired the impression that mroonga is capable of doing what its makers say it ought to be able to do.

pgulutzan, January 1, 2014. Category: MariaDB.

The USA’s healthcare.gov site and LAMP

The USA’s health care exchange site, healthcare.gov, has had well-publicized initial woes.

The New York Times has said one of the problems was the government’s choice of DBMS, namely MarkLogic. A MarkLogic employee has said that “If the exact same processes and analysis were applied to a LAMP stack or an Oracle Exa-stack, the results would have likely been the same.”

I don’t know why he picked Exastack for comparison, but I too have wondered whether things would have been different if the American government had chosen a LAMP component (MySQL or MariaDB) as a DBMS, instead of MarkLogic.

What is MarkLogic?

The company is a software firm founded in 2001 based in San Carlos California. It has 250 employees. The Gartner Magic Quadrant classes it as a “niche player” in the Operational DBMS Category.

The product is a closed-source XML DBMS. The minimum price for a perpetual enterprise license is $32,000 but presumably one would also pay for support, just as one does with MySQL or MariaDB.

There are 250 customers. According to the Wall Street Journal “most of its sales come from dislodging Oracle Corp.”

One of the customers, since 2012 or before, is CMS (the Centers for Medicare and Medicaid), which is a branch of the United States Department of Health and Human Services. CMS is the agency that built the healthcare.gov online portal.

Is MarkLogic responsible for the woes?

Probably MarkLogic is not the bottleneck.

It’s not even the only DBMS that the application queries. There is certainly some contact with other repositories during a get-acquainted process, including Oracle Enterprise Identity management — so one could just as easily blame Oracle.

There are multiple other vendors. USA Today mentions Equifax, Serco, Optum/QSSI, and the main contractor
CGI Federal.

A particular focus for critics has been a web-hosting provider, Verizon Terremark. They have been blamed for some of the difficulties and will eventually be replaced by an HP solution. HP also has a fairly new contract for handling the replication.

Doubtless all the parties would like to blame the other parties, but “the Obama administration has requested that all government officials and contractors involved keep their work confidential”.

It’s clear, though, that the site was launched with insufficient hardware. Originally it was sharing machines with other government services. That’s changed. Now it has dedicated machines.

But the site cost $630 million so one has to suppose they had money to buy hardware in the first place. That suggests that something must have gone awry with the planning, and so it’s credible what a Forbes article is saying, that the government broke every rule of project management.

So we can’t be sure because of the government confidentiality requirement, but it seems unlikely that MarkLogic will get the blame when the dust settles.

Is MarkLogic actually fast?

One way to show that MarkLogic isn’t responsible for slowness, would be to look for independent confirmations of its fastness. The problem with that is MarkLogic’s evaluator-license agreement, from which I quote:

…
MarkLogic grants to You a limited, non-transferable, non-exclusive, internal use license in the United States of America
…
[You must not] disclose, without MarkLogic’s prior written consent, performance or capacity statistics or the results of any benchmark test performed on Software
…
[You must not] use the Product for production activity,
…
You acknowledge that the Software may electronically transmit to MarkLogic summary data relating to use of the Software

— http://developer.marklogic.com/products

These conditions aren’t unheard of in the EULA world, but they do have the effect that I can’t look at the product at all (I’m not in the United States), and others can look at the product but can’t say what they find wrong with it.

So it doesn’t really matter that Facebook got 13 million transactions/second in 2011, or that the HandlerSocket extension for MySQL got 750,000 transactions/second with a lot less hardware. Possibly MarkLogic could do better. And I think we can dismiss the newspaper account that MarkLogic “continued to perform below expectations, according to one person who works in the command center.” Anonymous accounts don’t count.

So we can’t be sure because of the MarkLogic confidentiality requirements, but it seems possible that MarkLogic could outperform its SQL competitors.

Is MarkLogic responsible for absence of High Availability?

High Availability shouldn’t be an issue.

At first glance the reported uptime of the site — 43% initially, 90% now — looks bad. After all, Yves Trudeau surveyed MySQL High Availability solutions years ago and found even the laggards were doing 98%. Later the OpenQuery folks reported that some customers find “five nines” (99.999%) is too fussily precise so let’s just round it to a hundred.

At second glance, though, the reported uptime of the site is okay.

First: The product only has to work in 36 American states and Hawaii is not one of them. That’s only five time zones, then. So it can go down a few hours per night for scheduled maintenance. And uptime “exclusive of scheduled maintenance” is actually 95%.

Second: It’s okay to have debugging code and extra monitoring going on during the first few months. I’m not saying that’s what’s happening — indeed the fact that they didn’t do a 500-simulated-sites test until late September suggests they aren’t worry warts — but it is what others would have done, and therefore others would also be below 99% at this stage of the game.

So, without saying that 90 is the new 99, I think we can admit that it wouldn’t really be fair to make a big deal about some LAMP installation that has higher availability than healthcare.gov.

Is it hard to use?

MarkLogic is an XML DBMS. So its principal query language is XQuery, although there’s a section in the manual about how you could use SQL in a limited way.

Well, of course, to me and to most readers of this blog, XQuery is murky gibberish and SQL is kindergartenly obvious. But we have to suppose that there are XML experts who would find the opposite.

What, then, can we make out of the New York Times’s principal finding about the DBMS? It says:

“Another sore point was the Medicare agency’s decision to use database software, from a company called MarkLogic, that managed the data differently from systems by companies like IBM, Microsoft and Oracle. CGI officials argued that it would slow work because it was too unfamiliar. Government officials disagreed, and its configuration remains a serious problem.”

— New York Times November 23 2013

Well, of course, to me and to most readers of this blog, the CGI officials were right because it really is unfamiliar — they obviously had people with experience in IBM DB2, Microsoft SQL Server, or Oracle (either Oracle 12c or Oracle MySQL). But we have to suppose that there are XML experts who would find the opposite.

And, though I think it’s a bit extreme, we have to allow that it’s possible the problems were due to sabotage by Oracle DBAs.

Yet again, it’s impossible to prove that MarkLogic is at fault, because we’re all starting off with biases.

Did the problem have something to do with IDs?

I suspect there was an issue with IDs (identifications).

It starts off with this observation of a MarkLogic feature: “Instead of storing strings as sequences of characters, each string gets stored as a sequence of numeric token IDs. The original string can be reconstructed using the dictionary as a lookup table.”

It ends with this observation from an email written on September 27 2013 by a healthcare.gov worker: “The generation of identifiers within MarkLogic was inefficient. This was fixed and verified as part of the 500 user test.”

Of course that’s nice to see it was fixed, but isn’t it disturbing that a major structural piece was inefficient as late as September?

Hard to say. Too little detail. So the search for a smoking gun has so far led nowhere.

Is it less reliable?

Various stories — though none from the principals — suggest that MarkLogic was chosen because of its flexibility. Uh-oh.

The reported quality problems are “one in 10 enrollments through HealthCare.gov aren’t accurately being transmitted” and “duplicate files, lack of a file or a file with mistaken data, such as a child being listed as a spouse.”

I don’t see how the spousal problem could have been technical, but the duplications and the gone-missings point to: uh-oh, lack of strong rules about what can go in. And of course strong rules are something that the “relational” fuddy-duddies have worried about for decades. If the selling point of MarkLogic is in fact leading to a situation which is less than acceptable, then we have found a flaw at last. In fact it would suggest that the main complaints so far have been trivia.

This is the only matter that I think looks significant at this stage.

How’s that hopey-changey stuff working out for your Database?

The expectation of an Obama aide was: “a consumer experience unmatched by anything in government, but also in the private sector.”

The result is: so far not a failure, and nothing that shows that MarkLogic will be primarily responsible if it is a failure.

However: most of the defence is along the lines of “we can’t be sure”. That cuts both ways — nobody can say it’s “likely” that LAMP would have been just as bad.

pgulutzan, December 8, 2013. Category: MySQL / MariaDB, NoSQL.

Tuples

“It is better to keep silence and be thought a fool, than to say ‘tuple’ and remove all doubt.”

But recently people have been using the word “tuple” more frequently. Doubtless all those people know that in relational databases a tuple (or a tuple value) is a formal term for a row of a table. It’s possible to know a bit more than that.

Pronounced Tyoople, Toople, or Tuhple?

The Oxford Dictionaries site says Tyoople. Other dictionaries are neutral about the terms from which Tuple was derived (sextuple, octuple, etc.), for example Merriam-Webster says they usually end in Toople but Tuhple is an accepted alternate, and the Oxford Canadian Dictionary says it’s always Tuhple. So the question comes down to: what is the proper way in a database context?

I found one book that says Tuple rhymes with Scruple, that is, it’s Toople: Rod Stephens, Beginning Database Design Solutions. But Mr Stephens also tells us that tables are called relations because “the values of a row are related”, so I’m wary about him.

I found four books that say Tuple rhymes with Couple, that is, it’s Tuhple:

David Kroenke, Database Processing: Fundamentals, Design and Implementation
Kevin Loney, Oracle9i The Complete Reference
Donald Burleson, Oracle High-Performance SQL Tuning
Paul Nielsen, SQL Server 2005 Bible

Then I found the decisive one:
C.J.Date, An Introduction To Database Systems.
I quote:

The relational model therefore does not use the term “record” at all; instead it uses the term “tuple” (rhymes with “couple”).

Since C.J.Date had many conversations with the The Founder (E.F.Codd), and Mr Codd would have been certain to correct Mr Date if he had mispronounced, this is decisive. Most writers in the field, including the one who ought to know, are saying that it rhymes with couple.

Wait a minute — couple was originally a French word, and the French would use an oo sound, what about that? A good explanation, although it’s based on analogy, is that some Middle English words with oo (like blood and flood) changed in stages from the oo sound to the uh sound in the centuries following the Great English Vowel Shift. See the Wikipedia article about Phonological history of English high back vowels. So the reply to people who say “etymologically it was an oo sound” would be “yes, but oo changed to uh as part of a trend, get modern”.

But what if it’s a non-relational tuple?

Quoting C.J.Date again (from The Relational Database Dictionary):

NOTE: Tuples as defined in the relational model differ in certain respects from the mathematical construct of the same name. In particular, tuples in mathematics typically don’t have named attributes; instead, their attributes are identified by their ordinal position, left to right.

Aha. So if a sequence of values doesn’t have a corresponding header with a sequence of column names, it shouldn’t be called a row (that would be relational) but it could be called a tuple — provided it’s not in a relational database. In practice that’s seems to be a fairly common usage, but I’ll highlight the products where it seems to be the preferred usage.

Tuple Spaces.
The modest idea of just filling a space with tuples, and calling it a
“tuple space”, started off in 1982 with a language named Linda. Since then the general concept has gotten into various Java implementations.
Python.
The tuple is a supported data type that’s part of the Python core.
Of course there are similar things in other languages but I believe
that Python is the most prominent language that actually calls it a
tuple.
Pig.
FoundationDB.
Tarantool.
Actually one of my current projects is enhancing Tarantool’s documentation, which is what led me to wonder about the word.

The MySQL manual usually avoids the term, although it’s more frequent with NDB.

Alas

Recently I saw a poll with a single question:

What is to be Done? (a) Nothing (b) Something

I think that (a) won, hurrah. And yet it would have been a finer world if everyone had agreed that “tuple” meant a sequence of values, “record” meant a sequence of values which had fixed types, and “row” meant a sequence of values which had both fixed types and fixed names. If only Mr Codd had left the vocabulary alone …

pgulutzan, December 2, 2013. Category: Standard SQL.

Kleidophobia

The old Greek word for “key” was KLEIS so someone who fears keys would be a kleidophobe, and a xenokleidophobe specifically fears foreign keys, which brings me to the MySQL 3.23 manual …

5.4.5.1 Reasons NOT to Use Foreign Keys constraints
There are so many problems with foreign key constraints that we don’t know where to start:
Foreign key constraints make life very complicated …,
The speed impact is terrible …,
There is also a need to hold locks on many more tables …,
And so on.
And so on.

Of course those fierce words quietly disappeared from the manual some time after InnoDB came out with foreign key support in 2001. But since then the attempts to get foreign key support outside of InnoDB have usually gotten stuck.

* SolidDB had it, and I made a big deal out of that while predicting a bright future for the SolidDB storage engine.
* Falcon had it, although there were bugs in the implementation (they’re marked “to be fixed later”).
* PBXT had it
* IBMDB2I had it.

All of those storage engines are, for one reason or another, gone. So, is any independent storage engine left that supports foreign keys? I know of only one: ScaleDB. ScaleDB is targeted for clusters, and I’m not sure how much traction its makers are getting in that market, but I’m always joyous when I see a company paying attention to the standard feature requirements.

Anyway, to finish our look at MySQL documentation on the matter, I’ll quote once more:

“Foreign key support generic for all storage engines is scheduled for MySQL 6.x.”

This ancient promise is still in the current manual but I predict it too will silently disappear. The promise originally was made because there was progress on a project to do a lot for foreign keys, and the detailed plan is still publicly available on the mysql.com page for Worklog Item #148 (click the “High Level Architecture” button to view the important part). Though the project got put on the shelf, the worklog document is still good for seeing what, at the time, everyone agreed needed work.

The document’s “Feature Comparison Chart” compared InnoDB feature details with the ISO standard, SQL Server, DB2, Oracle, and PostgreSQL. It’s a long list. I’ll just pick the five standard “core” items where every vendor follows the standard. Except InnoDB.

1. References Primary/Unique

InnoDB lets foreign keys refer to keys that are not primary or unique.
For example:

CREATE TABLE pk (s1 CHAR, KEY (s1), s2 INT);
CREATE TABLE fk (s1 CHAR, FOREIGN KEY (s1) REFERENCES pk(s1));
INSERT INTO pk VALUES ('a',1),('a',2);
INSERT INTO fk VALUES ('a');

So now there’s a foreign key that points to two different parent-key rows. Guess what happens if you delete only one of the referenced-key rows, thus:

DELETE FROM pk WHERE s2 = 2;

If you guessed “that’s illegal because you can’t delete a row if there’s a foreign-key referring to it”, congratulations.
If you guessed “that’s legal because the foreign key is still referencing a valid undeleted row”, congratulations.
Either answer is logical, and it doesn’t matter which one is right.
What matters is that neither logic nor precedent are of any help when InnoDB goes boldly beyond both theory and common practice.

2. Default Primary Reference

It ought to be possible to just say that the foreign key references table PK, and not have to list the columns — if PK has a primary key, the DBMS can figure it out. But, partially as a result of the previous problem, InnoDB doesn’t figure it out.

3. Statement Check

Sometimes a check for a foreign-key violation should occur when the statement is truly over, when all other changes (due to triggers or cascading updates) have finished, so that the check isn’t affected by some temporary situation that would be cleared up before the statement ended.

4. Column REFERENCES

This looks okay, and there’s no error or warning:

CREATE TABLE pk (s1 INT);
CREATE TABLE fk (s1 INT REFERENCES pk(s1));

… but, woe to the uninitiated, it doesn’t make a foreign key. You can’t make a foreign key by putting a REFRENCES clause in the column definition.

5. REFERENCES Privilege

I suppose that newcomers might also try this:

GRANT REFERENCES ON t TO joe;

and think that they’re affecting security for foreign keys. In fact they’re creating a privilege that does nothing.

The current situation is less than perfect, and I cast the blame on all the closet xenokleidophobes. You know who you are.

pgulutzan, November 12, 2013. Category: MySQL.

TokuDB Features

Many people have tested whether the TokuDB storage engine runs faster and compresses better than the default storage engine, InnoDB. I am more concerned about TokuDB’s features. Or, to put it unfairly: can it do everything that InnoDB can do, in a current version?

Vadim Tkachenko of Percona supplied a binary download for a platform that I happen to have (Ubuntu 12.04), so I got MySQL 5.6 + TokuDB 7.1 up and running within 15 minutes. Then I found that I could not break anything within 60 minutes. I I conclude that “ease of use” and “stability” are okay. (For anyone who thinks I’m rushing: yes, but usually finding a bug in a new version takes less than an hour so I’m judging from experience.)

Almost everything = No problem

I tried compound indexes, long strings, the relatively-new TIMESTAMP(6) data type with fractional seconds, utf8mb4 varchar columns with German collation, transaction rollbacks with or without savepoints, crash recovery, and partitions.

I always got the same results as I get with InnoDB. It seems that the Tokutek folks looked at compatibility before declaring that their product is a “drop-in replacement storage engine for MySQL and MariaDB”.

Foreign keys

These statements didn’t cause an error but the FOREIGN KEY clause was ignored:

CREATE TABLE tp (s1 INT, PRIMARY KEY (s1)) engine=tokudb;
CREATE TABLE tf (s1 INT, FOREIGN KEY (s1) REFERENCES tp (s1)) engine=tokudb;

If I use InnoDB, the statements succeed. In basic English, TokuDB doesn’t support referential integrity.

This is a big deal because foreign keys are part of “core” standard SQL. In mitigation, I have to allow that many users eschew foreign keys in this benighted sharded world, and that in theory one could replace some foreign-key functionality with triggers. But I doubt that anybody ever does.

Full text

These statements resulted in an error:

CREATE TABLE tq (s1 TEXT) engine=tokudb;
CREATE FULLTEXT INDEX iq ON tq (s1);

If I use InnoDB, the statements succeed, because InnoDB (since version 5.6.4) supports full-text indexing, TokuDB doesn’t.

I’m sure I could hear objections like “that’s not what TokuDB is for” (true), “InnoDB’s implementation is bad” (actually that’s my own opinion but I’m sure somebody somewhere shares it), or “use Sphinx” (I don’t understand the logic here but I heard it when I complained about a full-text deficiency in another product).

To refute all objections, I went to bugs.mysql.com, clicked “Return only bugs with status = ALL”, clicked “Show = ALL”, clicked “Restrict to bugs in selected categories = MySQL Server: FULLTEXT search”, and clicked “Search”. I got 107 hits So full text must be important to somebody, else they wouldn’t report bugs.

Isolation

I started two connections, both with autocommit = 0 and default transaction isolation = REPEATABLE READ. On the first connection I said

CREATE TABLE t2 (s1 INT, UNIQUE(s1)) engine=tokudb;
INSERT INTO t2 VALUES (0);

On the second connection I said:

INSERT INTO t2 VALUES (1);

Result: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction.
If I use engine=innodb, I do not see this error, both INSERTs work.

I also see that there’s a documented behaviour difference with SELECT FOR UPDATE.

I conclude that TokuDB’s different locking methods mean that some SQL statements will have different effects in multi-user environments, if one switches from InnoDB.

Names

A storage engine maker cannot muck with MySQL’s syntax, with two exceptions: (1) this rule doesn’t apply to InnoDB; (2) new variables or information_schema tables can be added that are specific to tuning or monitoring.

So I said “SHOW VARIABLES LIKE ‘tokudb%'” … and I did see some things that bothered me. For example:
() tokudb_lock_timeout seems to be something like innodb_lock_wait_timeout so it’s unfortunate that they didn’t call it tokudb_lock_wait_timeout
() tokudb_commit_sync seems to do some things like innodb_flush_log_at_trx_commit so it’s unfortunate that they didn’t use the same numbers
() tokudb_fs_reserve_percent? well, I couldn’t tell if it’s unfortunate or not because I couldn’t find it in the TokuDB user manual.

Platforms

Tokutek’s FAQ says that TokuDB was tested on Centos 5 and CentOS 6. Doubtless it also works on all other Linux 64-bit distros (I downloaded for Ubuntu 12.04 and the MariaDB manual mentions Fedora). I see that Antony Curtis got it going on his Macbook.

That leaves three major items from the list of platforms that MySQL supports: FreeBSD, Solaris, and Windows. The trouble is that one of them, Windows, is what beginners or students might start with. As a result, MySQL + InnoDB or MariaDB + InnoDB will remain the default choice for introductory texts and some personal uses.

I can’t see how this will change, since there’s no money in the entry-level market and there’s no point in a “Windows server” for Tokutek’s real objective, which is to capture a piece of the must-have-performance market. But if they don’t care, the mind share won’t be there.

You’re forearmed

I fear that, by concentrating on TokuDB’s performance advantages, people have failed to worry about losing features that are available with InnoDB. Even if you conclude that you want to use TokuDB — especially if you conclude that you want to use TokuDB — you should be aware of them.

pgulutzan, November 4, 2013. Category: MySQL / MariaDB.

The UTF-8 world is not enough

In English there are very few words of Japanese origin but I think this one has a great future: mojibake. Mojibake is the garbage you see when MySQL or MariaDB has a column definition saying character set A, stores into it a string that’s actually in character set B, then ships it to a client which expects everything to be in character set C.

For some DBMSs (Drizzle and NuoDB spring to mind) it’s apparent that the developers decided that users could avoid mojibake, and developers could avoid a lot of extra trouble, if everybody had the same character set: UTF-8. Well, MySQL and MariaDB have web users and UTF-8 is popular on the web. According to Web Technology Surveys, 80% of web sites use UTF-8. There’s a Google study that came to a vaguely similar conclusion.

And yet, and yet, the arguments are devastating for why should MySQL or MariaDB continue to support character sets other than UTF-8, and even add one or two more.

Space

With a Greek character set like 8859-7, it takes one (1) byte to store one ordinary Greek character. With UTF-8 it takes two (2) bytes to store one ordinary Greek character. So 8859-7 is 50% cheaper for Greek words, and the same ratio holds for other 8-bit non-Latin language-specific character sets like Bulgarian.

With a multi-byte character set like euckr (for Korean) things get more complicated and results will vary. I took the lead paragraph for the Korean Wikipedia article on Seoul, dumped the contents into columns with different character set definitions, and compared the sizes with the octet_length() function:

utf8     1260              /* worst */
utf16    1138              /* middle */
euckr     916              /* best */

Unsurprisingly, it takes fewer bytes to store Korean text with a “made-in-Korea” character set. Also it’s a bit cheaper to use utf16 (which always requires two bytes per character for ordinary Korean words), which is doubtless one reason that the SQL Server folks prefer UTF-16 over UTF-8. I of course will ignore any objections that space doesn’t matter, or that compression algorithms exist.

Conversion trouble

It is true that Unicode’s repertoire is a superset of what’s in Japanese character sets like sjis or ujis. But that does not mean that you can start with sjis text, convert to Unicode, then convert back to sjis. On some rare occasions such a “round trip” will fail. This happens because the mapping is not always one-to-one. But it also happens because there are slight differences between various versions of sjis — that’s a fact which causes no difficulty if there’s no conversion, but causes some difficulty if something comes in from the client as sjis, gets converted to Unicode for storage, then gets converted back to sjis when the server retrieves something for the client.

Legacy

At the start of this post I mentioned that 80% of web sites use UTF-8. However, that is the world-wide figure. For a language-specific look, here are two sobering statistics:

* Only 54% of Chinese-language web sites predominantly use UTF-8
* Only 60% of Japanese-language web sites predominantly use UTF-8
Source: Web Technology Surveys

Hunting for examples, I went to the alexa.com page for “Top Sites in China”, then on my Firefox browser I clicked Tools > Page Info: Encoding. Five of the top ten said their encoding was UTF-8, but the other five said their encoding was GBK. This is not a reliable test because the site owner could be fibbing. But even if only a few are telling the truth, it’s clear there are many holdouts in the Non-UTF-8 crannies.

Of course for all of these arguments there are counter-arguments or even possible refutations. But they wouldn’t matter to the DBMS vendor. The vendor is supposed to help the customer with the customer’s plan, rather than explain that they should do something which, with suspicious convenience, would make the vendor’s life easier.

I had a look at the open-source SQL DBMSs, and as far as I could tell only five support both storage and input with a variety of encodings: MySQL and MariaDB of course, but also PostgreSQL, Ingres, and Firebird.
VoltDB definitely cannot. The Java-based ones apparently cannot, but I didn’t make certain.

pgulutzan, October 29, 2013. Category: MySQL / MariaDB.

Regular (RLIKE and REGEXP) expressions: Good News

Ten years ago, MySQL got a “bug report” about trouble with RLIKE. It got marked “won’t fix” because MySQL used a regular-expression library that couldn’t handle non-ASCII characters reliably. Over time there were multiple similar or duplicate bug reports but the only result was a note in the MySQL manual saying, in effect, “tough luck”. Until now.

Actually the fix is in a pre-release of the bzr branch for MariaDB 10.0.5, and I can’t predict that the MySQL folks will copy it, but it looks good to me, and I clap for the original coders who made the “Perl Compatible Regular Expressions” library, the “Google Summer of Code” folks who pay students to help with open source projects, the student coder in question — Sudheera Palihakkara from a Sri Lanka university — and finally Alexander Barkov from the MariaDB foundation.

So what were the problems and what’s been solved — or not solved?

Solved: Handling any characters

Most of the complaints about REGEXP have been about handling of accented Latin characters, Cyrillic characters, or, in short: regardless of character set or collation, if any character in the pattern was outside the ASCII range or was NUL, then there would be false hits, no hits, miscalculated lengths, or failed case conversions. I took examples from the “how to repeat” section of those bug reports, and tried them with MariaDB 10.0.5. Results:

/* Bug#34473 */
MariaDB [test]> SELECT 'Ørneborgen' RLIKE '^[Ö]';
+-----------------------------+
| 'Ørneborgen' RLIKE '^[Ö]'   |
+-----------------------------+
|                           0 |
+-----------------------------+
1 row in set (0.03 sec)

/* Bug#54576 */
MariaDB [test]> SELECT 'č' REGEXP '^.$';
+-------------------+
| 'č' REGEXP '^.$'  |
+-------------------+
|                 1 |
+-------------------+
1 row in set (0.00 sec)

/* Bug#52080 */
MariaDB [test]> SELECT 'Я' regexp 'я';
+------------------+
| 'Я' regexp 'я'   |
+------------------+
|                1 |
+------------------+
1 row in set (0.02 sec)

/* Bug#63439 */
MariaDB [test]> SELECT 'ääliö' REGEXP '^[aä]{1}[aä]{1}li[oö]{1}$';
+--------------------------------------------------+
| 'ääliö' REGEXP '^[aä]{1}[aä]{1}li[oö]{1}$'       |
+--------------------------------------------------+
|                                                1 |
+--------------------------------------------------+
1 row in set (0.03 sec)

/* Bug#70470 */
MariaDB [test]> SELECT 'abc\0def' REGEXP 'def';
+-------------------------+
| 'abc\0def' REGEXP 'def' |
+-------------------------+
|                       1 |
+-------------------------+
1 row in set (0.00 sec)

All the above answers are correct, for the first time ever.

In fact one could say that the new REGEXP handles too many characters. It’s adjusted for Unicode 6.2, a version that neither MySQL nor MariaDB support yet. This will of course be of concern to those who use the Meroitic alphabet, but they’re a fading minority.

Not solved: incompatibilities

I guess it was inevitable that, since there’s a different underlying library, the text of some error messages would change. For example I get this:

MariaDB [test]> select '1' rlike rpad('1',100,'(');
ERROR 1139 (42000): Got error 'missing ) at offset 100' from regexp

and that’s okay, but in MySQL 5.6.14 the error message is:
“ERROR 1139 (42000): Got error ‘parentheses not balanced’ from regexp”.

A bit more interesting is this query:

/* Bug#68153 */
MariaDB [test]> SELECT CHAR(126) REGEXP '[[.tilde.]]';
ERROR 1139 (42000): Got error 'POSIX collating elements
 are not supported at offset 1' from regexp

and that looks a little less okay. In MySQL 5.6.14 the same search condition doesn’t cause an error, it’s considered to be “true”.

Solved: Handling more Posix

The MariaDB folks have proudly announced that version 10.0.5 supports “recursive patterns, named capture, look-ahead and look-behind assertions, non-capturing groups, non-greedy quantifiers”. To which I’m sure that the response will be: What? How shocking that non-greedy quantifiers weren’t added ten years ago!

Well, maybe not. But when looking at new features, it’s wrong to ask merely whether there’s some advantage for the non-greedy man on the Clapham omnibus. The point is it’s supposed to be more compatible with modern Perl programs, with Posix, and with all the box-tickers who will ask: do you really support regular expressions? And if someday I reach that pinnacle of ambition, to figure out grep, then I’ll doubtless appreciate look-behind assertions in the way they deserve to be appreciated.

Unsolved: handling more standard SQL

In the operetta Die Fledermaus there’s a refrain: “Happy is the one who forgets what isn’t going to be changed”. But I, alas, can’t forget the comment that I wrote on Bug#746 ten years ago, showing there are at least six behavioural differences between REGEXP and the optional-SQL-standard operator, SIMILAR TO. The incompatibilities arise because SIMILAR TO works like LIKE, so it’s more SQL-ish, and less Posix-ish. Thus, improvements in REGEXP alone are not an advance towards more standard SQL. This would matter more if there were many other SQL implementations supporting SIMILAR TO. The only ones I know of are PostgreSQL and Firebird.

pgulutzan, October 22, 2013. Category: MySQL / MariaDB.

← Previous page