Chinese and Japanese and Korean (CJK) text usually has no spaces between words. Conventional full-text search does its tokenizing by looking for spaces. Therefore conventional full-text search will fail for CJK.
One workaround is bigrams. Suppose the text is
There should be three index keys, one for each two-character sequence:
册免, 免从, and 从冘.
Now, in a search like
SELECT * FROM t WHERE MATCH(text_column) AGAINST ('免从');
a bigram-supporting full-text index will have a chance. It's wasteful and there will be false hits whenever the bigram isn't really a "word", but the folks in CJK-land have found that bigrams (or the three-character counterpart, trigrams) actually work.
One way to get bigrams for MySQL or MariaDB is to get mroonga.
Back in 2008 a project named Senna attracted the attention of my colleagues at MySQL, but didn't go on to world conquest.
Since around 2011 the groonga project, billed as a "successor" to Senna, has been outing regular releases of a generic library that can connect to more than one DBMS. "Mroonga" is the "M[ySQL storage engine that interfaces with the] groonga [code]". So, although it hasn't been packaged with MySQL or MariaDB until now, it's not all new or untested code. What's new is that MariaDB will, real soon now, include mroonga as part of the regular MariaDB download. Poof, credibility at last.
I understand that mroonga has features that make it interesting:
* It's a column-storage engine, which I suppose makes it an alternative to Infobright or InfiniDB
* Some of its users or developers also have an involvement with the Spider storage engine, which I suppose means there wouldn't be too much trouble using the two engines in concert: mroonga for full-text and Spider for sharding.
... but I didn't look at the column storage or the Spider compatibility. I cared only that SHOW ENGINES said it's "CJK-ready fulltext search".
Of course, SHOW ENGINES can also show MyISAM and InnoDB and SphinxSE, and they can all do full-text searching too. One might choose MyISAM because for a long time that was the only engine that had full-text, or choose InnoDB because it's the default nowadays, or choose SphinxSE because it has lots of features including stemming. But those aren't targeted for the CJK niche. For example, MariaDB has no bigrams. Sphinx's documentation says it does have bigrams, but a glance shows instantly "those are not the bigrams you're looking for".
The only questions, then, are (a) does mroonga really handle full-text, and (b) does mroonga really handle CJK?
At the time I'm writing this, MariaDB 10.0.8 doesn't have mroonga yet. There are instructions for getting packages with Windows and Ubuntu etc., but such things never work for me -- they'll always be out of synch with something else that I've got, or with what the main vendor (Oracle or MariaDB in this case) is updating. So I decided to build from source. Getting mroonga was easy since I already have Bazaar:
bzr branch lp:~mroonga/maria/mroonga
and then build. Actually the initial build failed because mroonga required CMake 2.8.8 which doesn't come with Ubuntu 12.04. But after I worked around that, I had a server and all I had to say was
INSTALL PLUGIN mroonga SONAME 'ha_mroonga.so'; CREATE FUNCTION last_insert_grn_id RETURNS INTEGER SONAME 'ha_mroonga.so'; SET GLOBAL mroonga_log_level=NONE;
I found more out by looking at some documentation, which is nowhere near Oracle quality but is mostly in readable English.
Then I was able to do this:
CREATE TABLE t ( id INT NOT NULL DEFAULT '0', `text_column` VARCHAR(100) CHARACTER SET utf8 DEFAULT NULL, PRIMARY KEY (id), FULLTEXT KEY `text_column` (text_column) ) ENGINE=mroonga DEFAULT CHARSET=latin1 COMMENT='engine "innodb"'
It's probably less than obvious that I'm looking at a good-looking feature. The point is: with this syntax I can use mroonga with an InnoDB table. Whoopie. Although that means I lose mroonga's other features, I can use it without losing whatever I find good about InnoDB. And then I was able to do this:
SELECT * FROM t WHERE MATCH(text_column) AGAINST ('x');
Again it's probably less than obvious that I'm looking at a good-looking feature. The MATCH ... AGAINST syntax is idiosyncratic and unobvious -- don't expect this to arrive in an SQL standard near you any time soon. But it's what the long-term MyISAM user is used to, so the transition isn't painful.
The J (Japanese) part is certainly there. The developers are Japanese. One of the optional extras is integration with MeCab which is a tool that can handle Japanese morphology -- it's like solving the "no spaces" problem by understanding a bit about the Japanese language, which after all is the solution that a human would use with no-spaces text. I don't see, though, that there's been equivalent attention paid for the C (Chinese) and K (Korean) parts of CJK. So I just looked at the bigrams, since they're all that C or K could benefit from.
I created a million-row table containing randomly-chosen characters, mostly Kanji, but with a mix of Latin and Kana letters. How I made it is tedious, but perhaps somebody out there will want to know how to make randomly populated tables with such "data", so I'll put the code in a comment on this posting.
I had a choice between a bewildering variety of ways to tokenize. The default one, TokenBigram, did the job. The good news for me was that it didn't do bigrams when the text was entirely Latin -- that would be silly -- but in a separate test I did see that it indexed Latin characters as ordinary words. So one of the worries that I had -- that mroonga would be only good for CJK but not for non-CJK mixed inside the same text column -- turned out not to be a worry.
And now, the essential point, the "l'essence du bigram" (which would look good on a restaurant menu, eh?) ...
MariaDB [test]> SELECT COUNT(*) FROM t WHERE MATCH(text_column) AGAINST ('免从'); +----------+ | COUNT(*) | +----------+ | 1 | +----------+ 1 row in set (0.00 sec)
It found a row with a bigram! But is that the right count?
MariaDB [test]> SELECT COUNT(*) FROM t WHERE text_column LIKE '%免从%'; +----------+ | COUNT(*) | +----------+ | 1 | +----------+ 1 row in set (2.48 sec)
Yes, it's the right count -- a LIKE search confirms it.
The difference is that the LIKE search took 2.5 seconds because it had to scan a million rows. The mroonga full-text search took approximately zero seconds because it used an index. That doesn't mean that mroonga is fast -- not every search is a benchmark. I did find a graph that shows mroonga throughput is better than MyISAM's or SphinxSE's but probably if I looked hard I could find a graph that says the opposite.
Although I have done no benchmarks or bug hunts, I have acquired the impression that mroonga is capable of doing what its makers say it ought to be able to do.
As mentioned in the article, this is a way to make a million random rows.