MySQL, MariaDB, International Components for Unicode

In an earlier blog post I wrote "MySQL has far better support for character sets and collations than any other open-source DBMS, except sometimes MariaDB."

That's no longer always true, because ICU.

ICU -- International Components for Unicode -- was a Sun + IBM initiative that started over 20 years ago, and has become a major component of major products. The key advantage is that it provides a lax-licensed library that does all the work that's needed for the Unicode Collation ALgorithm and the CLDRs. No competitive products do that.

When I was with MySQL we considered using ICU. We decided "no". We had good reasons then: it didn't do anything new for the major languages that we already handled well, it seemed to change frequently, we preferred to listen to our user base, there wasn't a big list of appropriate rules in a "Common Locale Data Repository" (CLDR) in those days, we expected it to be slow, we worried about the license, and it was quite large. But since then the world has moved on.

The support for ICU among DBMSs

ICU is an essential or an optional part of many products (they're listed on the ICU page in section "who uses"). So there's no problem finding it in Lucene, PHP 6, or a major Linux distro. But our main concern is DBMSs.

DB2          total support, IBM is an ICU evangelist
Firebird     total support
SQLite       optional support (you have to download and recompile yourself)
CouchDB      you're supposed to download ICU but
             you seem to have choices
Sybase IQ    for sortkeys and for Thai    

PostgreSQL catches up?

MySQL/MariaDB have their own code for collations while PostgreSQL depends on the operating system's libraries (libc etc.) to do all its collating with strcoll(), strxfrm(), and equivalents. PostgreSQL is inferior for these reasons:

(1) when the operating system is upgraded your indexes might become corrupt because now the keys aren't where they're supposed to be according to the OS's new rules, and you won't be warned. For a typical horror story see here.

(2) libc had problems and it still does, for example see the bug report "strxfrm results do not match strcoll".

(3) libc is less sophisticated than ICU, for example towupper() looks at only one character at a time (sometimes capitalization should be affected by prior or following characters)

(4) ORDER BY worked differently on Windows than on Linux.

(5) Mac OS X in particular, and sometimes BSD, caused surprise when people found they lacked what libc had in Linux. Sample remarks: "you will have to realize that collations will not work on any BSD-ish OS (incl. OSX) for an UTF8 encoding.", and "It works fine if you use the English language, or if you don't use utf-8."

I've observed before that sometimes MySQL is more standards-compliant than PostgreSQL and this PostgreSQL behaviour is consistent with that observation. Although some people added or suggested ICU patches -- EnterpriseDB and Postgresapp come to mind -- those were improvements that didn't become part of the main line.

In August 2016 a well-known PostgreSQL developer proposed a patch in a thread titled ICU integration. Many others jumped in with support or with rather intelligent criticisms. In March 2017 the well-known developer posted the dread word "Committed". Hurrahs followed. Sample remark: "Congratulations on getting this done. It's great work, and it'll make a whole class of potential bugs and platform portability warts go away if widely adopted."

This doesn't destroy all of MySQL/MariaDB's advantages in the collation area -- a built-in bespoke routine will probably be faster than a generic one that's bloated with checks for things that will never happen, and PostgreSQL perhaps can't do case insensitive ordering without using upper(), and the ICU approach forces some hard trade-off decisions, as we'll see. But the boast "Only MySQL has consistent per-column collation support for multiple languages and multiple platforms" will lose sting.

The problems

If MySQL and/or MariaDB decided to add ICU to their existing collation support, what problems would they face?

The licence has changed recently, now it is a "Unicode license". You have to acknowledge the copyright and permission everywhere. It is compatible with GPL with some restrictions that shouldn't matter. So whatever license problems existed (I forget what they were) are gone.

The Fedora .tgz file is 15MB, the Windows zip file is 36 MB. The executables are a bit smaller, but you get the idea -- it takes longer to download and takes more storage space. For SQLite this was frightening because its applications embed the library, but to others this doesn't look like a big deal in the era of multi-gigabyte disk drives. The other consideration is that the library might already be there -- it's optional for many Linux packages (I'd also seen a report that it would be the norm in FreeBSD 11 but I didn't find it in the release notes).

According to ICU's own tests ICU can be faster than glibc. According to EnterpriseDB a sort or an index-build can be twice as fast with ICU as without it I'd be surprised if it ever beats MySQL/MariaDB's built-in code, but that's not a factor -- the built-in collations would stay. These tests just establish that the new improved ones would be at least passable.

One of the PostgreSQL folks worried about ICU because the results coming from the DBMS might not match what the results would be if they used strcoll() in their C programs and lc in their directory searches. But I've never heard of anyone having a problem with this in MySQL, which has never used the same algorithms as strcoll().

If an open-source application comes via a distro, it might have to accept the ICU version that comes with the distro. That's caused problems for Firebird and it's caused fear that you can't bundle your own ICU ("RH [Red Hat] and Debian would instantly rip it out and replace it with their packaged ICU anyway" was one comment on the PostgreSQL thread). EnterpriseDB did bundle, but they had to, because RHEL 6 had a uselessly old (4.2) ICU version on it. Ordinarily this means that the DBMS vendor does not have total control over what ICU version it will use.

If you can't bundle a specific version of ICU and freeze it, you have to worry: what if the collation rules change? I mentioned before how this frightened us MySQL oldies. For example, in early versions of the Unicode Collation Algorithm (what ICU implements), the Polish L-with-slash moved (ah, sweet memories of bygone bogus bug reports). and Upper(German Sharp S) changed (previously ß had no upper case). Such changes would have caused disasters if we'd used ICU in those days: indexes would have keys in the wrong order, CHECK clauses (if we'd had them) would have variable meaning, and some rows could be in different partitions.

But it's been years since a movement of a modern letter happened in a major European language. Look at the "Migration issues" that are described in the Unicode Collation Algorithm document:

UCA 6.1.0 2012-02-01 -- added the ignoreSP option
                        added an option for parametric tailoring
UCA 6.3.0 2013-08-13 -- removed the ignoreSP option
                        changed weight of U+FFFD
                        removed fourth-level weights
UCA 7.0.0 2014-05-23 -- clarifications of the text description
UCA 8.0.0 2015-06-01 -- removed contractions for Cyrillic accent letters except Й
UCA 9.0.0 2016-05-18 -- added support for Tangut weights

... If none of these match your idea of an issue, you probably don't have an issue. Plus you have a guarantee: "The contents of the DUCET [Default Unicode Collation Element Table which has the root collation for every character] will remain unchanged in any particular version of the UCA." That's wonderful because the DUCET is good for most languages; only the tailorings -- the special-purpose specifications in the CLDR -- seem to see changes with every release. But if you have them, I guess you would have to say:
* If there's an upgrade and the ICU version number is new, check indexes are in order
* If there's a network, the ICU version should be the same on all nodes, i.e. upgrade everything together
* Don't store weights (the equivalent of what strxfrm produces) as index keys.

Other news: there is no new release of Ocelot's GUI client for MySQL + MariaDB (ocelotgui) this month, but a few program changes have been made for those who download the source from github.

, April 11, 2017. No Comments. Category: MySQL / MariaDB.

About pgulutzan

Co-author of four computer books. Software Architect at MySQL/Sun/Oracle from 2003-2011, and at HP for a little while after that. Currently with Ocelot Computer Services Inc. in Edmonton Canada.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>