The UTF-8 world is not enough

In English there are very few words of Japanese origin but I think this one has a great future: mojibake. Mojibake is the garbage you see when MySQL or MariaDB has a column definition saying character set A, stores into it a string that’s actually in character set B, then ships it to a client which expects everything to be in character set C.

For some DBMSs (Drizzle and NuoDB spring to mind) it’s apparent that the developers decided that users could avoid mojibake, and developers could avoid a lot of extra trouble, if everybody had the same character set: UTF-8. Well, MySQL and MariaDB have web users and UTF-8 is popular on the web. According to Web Technology Surveys, 80% of web sites use UTF-8. There’s a Google study that came to a vaguely similar conclusion.

And yet, and yet, the arguments are devastating for why should MySQL or MariaDB continue to support character sets other than UTF-8, and even add one or two more.

Space

With a Greek character set like 8859-7, it takes one (1) byte to store one ordinary Greek character. With UTF-8 it takes two (2) bytes to store one ordinary Greek character. So 8859-7 is 50% cheaper for Greek words, and the same ratio holds for other 8-bit non-Latin language-specific character sets like Bulgarian.

With a multi-byte character set like euckr (for Korean) things get more complicated and results will vary. I took the lead paragraph for the Korean Wikipedia article on Seoul, dumped the contents into columns with different character set definitions, and compared the sizes with the octet_length() function:

utf8     1260              /* worst */
utf16    1138              /* middle */
euckr     916              /* best */

Unsurprisingly, it takes fewer bytes to store Korean text with a “made-in-Korea” character set. Also it’s a bit cheaper to use utf16 (which always requires two bytes per character for ordinary Korean words), which is doubtless one reason that the SQL Server folks prefer UTF-16 over UTF-8. I of course will ignore any objections that space doesn’t matter, or that compression algorithms exist.

Conversion trouble

It is true that Unicode’s repertoire is a superset of what’s in Japanese character sets like sjis or ujis. But that does not mean that you can start with sjis text, convert to Unicode, then convert back to sjis. On some rare occasions such a “round trip” will fail. This happens because the mapping is not always one-to-one. But it also happens because there are slight differences between various versions of sjis — that’s a fact which causes no difficulty if there’s no conversion, but causes some difficulty if something comes in from the client as sjis, gets converted to Unicode for storage, then gets converted back to sjis when the server retrieves something for the client.

Legacy

At the start of this post I mentioned that 80% of web sites use UTF-8. However, that is the world-wide figure. For a language-specific look, here are two sobering statistics:

* Only 54% of Chinese-language web sites predominantly use UTF-8
* Only 60% of Japanese-language web sites predominantly use UTF-8
Source: Web Technology Surveys

Hunting for examples, I went to the alexa.com page for “Top Sites in China”, then on my Firefox browser I clicked Tools > Page Info: Encoding. Five of the top ten said their encoding was UTF-8, but the other five said their encoding was GBK. This is not a reliable test because the site owner could be fibbing. But even if only a few are telling the truth, it’s clear there are many holdouts in the Non-UTF-8 crannies.

Of course for all of these arguments there are counter-arguments or even possible refutations. But they wouldn’t matter to the DBMS vendor. The vendor is supposed to help the customer with the customer’s plan, rather than explain that they should do something which, with suspicious convenience, would make the vendor’s life easier.

I had a look at the open-source SQL DBMSs, and as far as I could tell only five support both storage and input with a variety of encodings: MySQL and MariaDB of course, but also PostgreSQL, Ingres, and Firebird.
VoltDB definitely cannot. The Java-based ones apparently cannot, but I didn’t make certain.

, October 29, 2013. Category: MySQL / MariaDB.

About pgulutzan

Co-author of four computer books. Software Architect at MySQL/Sun/Oracle from 2003-2011, and at HP for a little while after that. Currently with Ocelot Computer Services Inc. in Edmonton Canada.

5 Comments

  1. Also, MySQL / MariaDB UTF-8 is 3-byte UTF-8 only. Whereas, say, Oracle is true variable length. In MySQL / MariaDB you have to use utf8mb4 to use the full Unicode character set. So

  2. Justin Swanhart says:

    Given that fixed widths are used in many places, shorter character sets can save a lot. Proper names and blob text (blogs posts, comments, etc) need multi-byte encoding. Almost anything else can either use single-byte encoding or use a lookup table.

  3. I believe that the compatibility of character sets causes many of problemens. ISO-8859-1, ISO-8859-15, Windows-1252, Latin1, ASCII and ANSI are just to similar. It’s unpossible to ‘see’ how a string was encoded when only ‘simple ascii’ characters are used. Interesting enough latin1 is a official alias for iso-8859-1 according to wikipedia, but “show character set like ‘latin1’;” says it’s cp1252, but for € the values don’t seem to be matching the values on wikipedia.