Mroonga and me and MariaDB

Chinese and Japanese and Korean (CJK) text usually has no spaces between words. Conventional full-text search does its tokenizing by looking for spaces. Therefore conventional full-text search will fail for CJK.

One workaround is bigrams. Suppose the text is

册免从冘

There should be three index keys, one for each two-character sequence:

册免, 免从, and 从冘.

Now, in a search like

SELECT * FROM t WHERE MATCH(text_column) AGAINST ('免从');

a bigram-supporting full-text index will have a chance. It’s wasteful and there will be false hits whenever the bigram isn’t really a “word”, but the folks in CJK-land have found that bigrams (or the three-character counterpart, trigrams) actually work.

One way to get bigrams for MySQL or MariaDB is to get mroonga.

Why care about Yet Another Storage Engine)?

Back in 2008 a project named Senna attracted the attention of my colleagues at MySQL, but didn’t go on to world conquest.

Since around 2011 the groonga project, billed as a “successor” to Senna, has been outing regular releases of a generic library that can connect to more than one DBMS. “Mroonga” is the “M[ySQL storage engine that interfaces with the] groonga [code]”. So, although it hasn’t been packaged with MySQL or MariaDB until now, it’s not all new or untested code. What’s new is that MariaDB will, real soon now, include mroonga as part of the regular MariaDB download. Poof, credibility at last.

I understand that mroonga has features that make it interesting:
* It’s a column-storage engine, which I suppose makes it an alternative to Infobright or InfiniDB
* Some of its users or developers also have an involvement with the Spider storage engine, which I suppose means there wouldn’t be too much trouble using the two engines in concert: mroonga for full-text and Spider for sharding.
… but I didn’t look at the column storage or the Spider compatibility. I cared only that SHOW ENGINES said it’s “CJK-ready fulltext search”.

Of course, SHOW ENGINES can also show MyISAM and InnoDB and SphinxSE, and they can all do full-text searching too. One might choose MyISAM because for a long time that was the only engine that had full-text, or choose InnoDB because it’s the default nowadays, or choose SphinxSE because it has lots of features including stemming. But those aren’t targeted for the CJK niche. For example, MariaDB has no bigrams. Sphinx’s documentation says it does have bigrams, but a glance shows instantly “those are not the bigrams you’re looking for”.

The only questions, then, are (a) does mroonga really handle full-text, and (b) does mroonga really handle CJK?

Does mroonga work?

At the time I’m writing this, MariaDB 10.0.8 doesn’t have mroonga yet. There are instructions for getting packages with Windows and Ubuntu etc., but such things never work for me — they’ll always be out of synch with something else that I’ve got, or with what the main vendor (Oracle or MariaDB in this case) is updating. So I decided to build from source. Getting mroonga was easy since I already have Bazaar:

bzr branch lp:~mroonga/maria/mroonga

and then build. Actually the initial build failed because mroonga required CMake 2.8.8 which doesn’t come with Ubuntu 12.04. But after I worked around that, I had a server and all I had to say was

                                                                             
INSTALL PLUGIN mroonga SONAME 'ha_mroonga.so';
CREATE FUNCTION last_insert_grn_id RETURNS INTEGER SONAME 'ha_mroonga.so';
SET GLOBAL mroonga_log_level=NONE;

I found more out by looking at some documentation, which is nowhere near Oracle quality but is mostly in readable English.

Then I was able to do this:

CREATE TABLE t (
  id INT NOT NULL DEFAULT '0',
  `text_column` VARCHAR(100) CHARACTER SET utf8 DEFAULT NULL,
  PRIMARY KEY (id),
  FULLTEXT KEY `text_column` (text_column)
) ENGINE=mroonga DEFAULT CHARSET=latin1 COMMENT='engine "innodb"'

It’s probably less than obvious that I’m looking at a good-looking feature. The point is: with this syntax I can use mroonga with an InnoDB table. Whoopie. Although that means I lose mroonga’s other features, I can use it without losing whatever I find good about InnoDB. And then I was able to do this:

SELECT * FROM t WHERE MATCH(text_column) AGAINST ('x');

Again it’s probably less than obvious that I’m looking at a good-looking feature. The MATCH … AGAINST syntax is idiosyncratic and unobvious — don’t expect this to arrive in an SQL standard near you any time soon. But it’s what the long-term MyISAM user is used to, so the transition isn’t painful.

Does mroonga handle CJK?

The J (Japanese) part is certainly there. The developers are Japanese. One of the optional extras is integration with MeCab which is a tool that can handle Japanese morphology — it’s like solving the “no spaces” problem by understanding a bit about the Japanese language, which after all is the solution that a human would use with no-spaces text. I don’t see, though, that there’s been equivalent attention paid for the C (Chinese) and K (Korean) parts of CJK. So I just looked at the bigrams, since they’re all that C or K could benefit from.

I created a million-row table containing randomly-chosen characters, mostly Kanji, but with a mix of Latin and Kana letters. How I made it is tedious, but perhaps somebody out there will want to know how to make randomly populated tables with such “data”, so I’ll put the code in a comment on this posting.

I had a choice between a bewildering variety of ways to tokenize. The default one, TokenBigram, did the job. The good news for me was that it didn’t do bigrams when the text was entirely Latin — that would be silly — but in a separate test I did see that it indexed Latin characters as ordinary words. So one of the worries that I had — that mroonga would be only good for CJK but not for non-CJK mixed inside the same text column — turned out not to be a worry.

And now, the essential point, the “l’essence du bigram” (which would look good on a restaurant menu, eh?) …

MariaDB [test]> SELECT COUNT(*) FROM t WHERE MATCH(text_column) AGAINST ('免从');
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

It found a row with a bigram! But is that the right count?

MariaDB [test]> SELECT COUNT(*) FROM t WHERE text_column LIKE '%免从%';
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
1 row in set (2.48 sec)

Yes, it’s the right count — a LIKE search confirms it.

The difference is that the LIKE search took 2.5 seconds because it had to scan a million rows. The mroonga full-text search took approximately zero seconds because it used an index. That doesn’t mean that mroonga is fast — not every search is a benchmark. I did find a graph that shows mroonga throughput is better than MyISAM’s or SphinxSE’s but probably if I looked hard I could find a graph that says the opposite.

And the crowd goes wild

Although I have done no benchmarks or bug hunts, I have acquired the impression that mroonga is capable of doing what its makers say it ought to be able to do.

pgulutzan, January 1, 2014. Category: MariaDB.

About pgulutzan

Co-author of four computer books. Software Architect at MySQL/Sun/Oracle from 2003-2011, and at HP for a little while after that. Currently with Ocelot Computer Services Inc. in Edmonton Canada.

View all posts by pgulutzan

Roles Review →

← The USA’s healthcare.gov site and LAMP

1 Comment

pgulutzan says:

January 9, 2014 at 10:56 am

As mentioned in the article, this is a way to make a million random rows.

USE test

/* The following makes a table with a million rows.
   The second column in each row contains random characters:
   spaces, Japanese kana (full-width katakana or hiragana),
   kanji. They're random. They're only for testing certain
   functionality of certain full-text storage engines. */
DELIMITER //
/* one_char(): Return: an integer = Unicode code point.
   Ranges of Unicode code points are:
   Space:                                            32
   Latin range (upper):                              65 to    90
   Kana (full-width) 3040 to 309f + 30a0 to 30ff  12352 to 12543
   Kanji range: 4e00 to 9faf                      19968 to 40879
   We'll use space, 10 Latin, 100 kana, 1000 kanji.
   Generate random values so that relative distribution of characters is:
   Space: 1%
   Latin: 10%
   Kana: 30%
   Kanji: 59% */

DROP TABLE IF EXISTS t//
CREATE TABLE t (id INT,
                text_column VARCHAR(100) CHARACTER SET utf8,
                PRIMARY KEY (id),
                FULLTEXT KEY text_column (text_column)
                )  ENGINE = mroonga COMMENT = 'engine "innodb"'//
DELIMITER ;
DELIMITER //
DROP FUNCTION IF EXISTS one_char//
CREATE FUNCTION one_char () RETURNS INT
BEGIN
  DECLARE r1,r2 INT;
  SET r2 = RAND()*1000000;
  SET r1 = r2 % 100;
  IF r1 <= 2 THEN RETURN 32; END IF;
  IF r1 <=20 THEN RETURN ((RAND()*1000000)%10)+65; END IF;
  IF r1 <=50 THEN RETURN ((RAND()*1000000)%100)+12352; END IF;
  RETURN ((RAND()*1000000%1000))+19968;
  END//
select one_char(),one_char(),one_char(),one_char(),one_char()//
DELIMITER ;
/* one_million_rows(): produce a table with 1,000,000 rows
   The first column is id: an integer for identification.
   The second column is c: a varchar(100) with random text.
   The random text comes from one_char(). */
DELIMITER //
DROP PROCEDURE IF EXISTS one_million_rows//
CREATE PROCEDURE one_million_rows ()
BEGIN
  DECLARE i, j INT;
  DECLARE c VARCHAR(100) CHARACTER SET ucs2;
  SET i = 1;
  WHILE i < 100 DO
    SET j = 1;
    SET c = '';
    WHILE j < 100 do
      SET c = CONCAT(c,CHAR(one_char() USING UCS2));
      SET j = j + 1;
      END WHILE;
    INSERT INTO t VALUES (i,c);
    SET i = i + 1;
    END WHILE;
  END//
DELIMITER ;
CALL one_million_rows();