Category: MariaDB


A glance at a MariaDB release candidate

Today I looked at the MariaDB Release Candidate wondering how my earlier predictions came out.

I predicted, for “roles”:

For all of the bugs, and for some of the flaws, there’s no worry — they’ll probably disappear.

In fact all the bugs are gone, and I belatedly realized (after some gentle nudges from a MariaDB employee) that some of the flaws weren’t flaws.
The inability to grant to PUBLIC still troubles me, but it looks like roles are ready to roll.

I predicted for “mroonga”:

At the time I’m writing this, MariaDB 10.0.8 doesn’t have mroonga yet.

In other words, I thought it would be in 10.0.8. It was not to be, as another MariaDB employee told me:

Unfortunately, mroonga appeared to have some portability problems. Most likely it won’t be in 10.0.8. There is a good chance it will be 10.0.9 though.

While I was looking, I tried out a new clause for the DELETE statement:
DELETE FROM t RETURNING select-list;
which returns the rows that just got deleted. It’s a new non-standard clause which increases MariaDB’s divergence from MySQL, so decent people should demand an explanation.

Oracle 12c looks similar but does something different:
DELETE FROM t RETURN|RETURNING expr [,expr...] INTO data-item [,data-item...];
SQL Server 2012 has a clause which looks different but does something similar:
DELETE FROM t OUTPUT select-list;
PostgreSQL has a clause which looks similar and does something similar:
DELETE FROM t RETURNING select-list;

So we’re seeing, yet again, the belief that PostgreSQL is a better model for MariaDB than the big boys are. By the way, the big boys are a bit more orthogonal — I’ve seen extensions similar to this one in UPDATE or MERGE statements. But the bigger orthogonality woe is that here we have a statement that returns a result set, but can’t be used in all the places where a result-set-returning sub-statement could work, like DECLARE CURSOR FOR DELETE, or INSERT … DELETE. Of course, I’m glad they don’t work. My point is only that now-it-works-now-it-doesn’t additions make the product look haphazard.

Update: How it’s done in standard SQL

This section was added on 2014-03-01, one week after the original post.
SQL:2011 has a non-core (optional) feature T495 “Combined data change and retrieval”. Ignoring a lot, the syntax is SELECT … FROM NEW|OLD|FINAL TABLE (INSERT|UPDATE|DELETE …) …. For example

 SELECT column1+5 FROM OLD TABLE (DELETE FROM t) AS t WHERE f = 77;

In this example table t is a “delta table”.
There’s an example with DECLARE CURSOR and FETCH in the DB2 10 for z/OS manual.

Mroonga and me and MariaDB

Chinese and Japanese and Korean (CJK) text usually has no spaces between words. Conventional full-text search does its tokenizing by looking for spaces. Therefore conventional full-text search will fail for CJK.

One workaround is bigrams. Suppose the text is

册免从冘

There should be three index keys, one for each two-character sequence:

册免, 免从, and 从冘.

Now, in a search like

SELECT * FROM t WHERE MATCH(text_column) AGAINST ('免从');

a bigram-supporting full-text index will have a chance. It’s wasteful and there will be false hits whenever the bigram isn’t really a “word”, but the folks in CJK-land have found that bigrams (or the three-character counterpart, trigrams) actually work.

One way to get bigrams for MySQL or MariaDB is to get mroonga.

Why care about Yet Another Storage Engine)?

Back in 2008 a project named Senna attracted the attention of my colleagues at MySQL, but didn’t go on to world conquest.

Since around 2011 the groonga project, billed as a “successor” to Senna, has been outing regular releases of a generic library that can connect to more than one DBMS. “Mroonga” is the “M[ySQL storage engine that interfaces with the] groonga [code]”. So, although it hasn’t been packaged with MySQL or MariaDB until now, it’s not all new or untested code. What’s new is that MariaDB will, real soon now, include mroonga as part of the regular MariaDB download. Poof, credibility at last.

I understand that mroonga has features that make it interesting:
* It’s a column-storage engine, which I suppose makes it an alternative to Infobright or InfiniDB
* Some of its users or developers also have an involvement with the Spider storage engine, which I suppose means there wouldn’t be too much trouble using the two engines in concert: mroonga for full-text and Spider for sharding.
… but I didn’t look at the column storage or the Spider compatibility. I cared only that SHOW ENGINES said it’s “CJK-ready fulltext search”.

Of course, SHOW ENGINES can also show MyISAM and InnoDB and SphinxSE, and they can all do full-text searching too. One might choose MyISAM because for a long time that was the only engine that had full-text, or choose InnoDB because it’s the default nowadays, or choose SphinxSE because it has lots of features including stemming. But those aren’t targeted for the CJK niche. For example, MariaDB has no bigrams. Sphinx’s documentation says it does have bigrams, but a glance shows instantly “those are not the bigrams you’re looking for”.

The only questions, then, are (a) does mroonga really handle full-text, and (b) does mroonga really handle CJK?

Does mroonga work?

At the time I’m writing this, MariaDB 10.0.8 doesn’t have mroonga yet. There are instructions for getting packages with Windows and Ubuntu etc., but such things never work for me — they’ll always be out of synch with something else that I’ve got, or with what the main vendor (Oracle or MariaDB in this case) is updating. So I decided to build from source. Getting mroonga was easy since I already have Bazaar:

bzr branch lp:~mroonga/maria/mroonga

and then build. Actually the initial build failed because mroonga required CMake 2.8.8 which doesn’t come with Ubuntu 12.04. But after I worked around that, I had a server and all I had to say was

                                                                             
INSTALL PLUGIN mroonga SONAME 'ha_mroonga.so';
CREATE FUNCTION last_insert_grn_id RETURNS INTEGER SONAME 'ha_mroonga.so';
SET GLOBAL mroonga_log_level=NONE;

I found more out by looking at some documentation, which is nowhere near Oracle quality but is mostly in readable English.

Then I was able to do this:

CREATE TABLE t (
  id INT NOT NULL DEFAULT '0',
  `text_column` VARCHAR(100) CHARACTER SET utf8 DEFAULT NULL,
  PRIMARY KEY (id),
  FULLTEXT KEY `text_column` (text_column)
) ENGINE=mroonga DEFAULT CHARSET=latin1 COMMENT='engine "innodb"'

It’s probably less than obvious that I’m looking at a good-looking feature. The point is: with this syntax I can use mroonga with an InnoDB table. Whoopie. Although that means I lose mroonga’s other features, I can use it without losing whatever I find good about InnoDB. And then I was able to do this:

SELECT * FROM t WHERE MATCH(text_column) AGAINST ('x');

Again it’s probably less than obvious that I’m looking at a good-looking feature. The MATCH … AGAINST syntax is idiosyncratic and unobvious — don’t expect this to arrive in an SQL standard near you any time soon. But it’s what the long-term MyISAM user is used to, so the transition isn’t painful.

Does mroonga handle CJK?

The J (Japanese) part is certainly there. The developers are Japanese. One of the optional extras is integration with MeCab which is a tool that can handle Japanese morphology — it’s like solving the “no spaces” problem by understanding a bit about the Japanese language, which after all is the solution that a human would use with no-spaces text. I don’t see, though, that there’s been equivalent attention paid for the C (Chinese) and K (Korean) parts of CJK. So I just looked at the bigrams, since they’re all that C or K could benefit from.

I created a million-row table containing randomly-chosen characters, mostly Kanji, but with a mix of Latin and Kana letters. How I made it is tedious, but perhaps somebody out there will want to know how to make randomly populated tables with such “data”, so I’ll put the code in a comment on this posting.

I had a choice between a bewildering variety of ways to tokenize. The default one, TokenBigram, did the job. The good news for me was that it didn’t do bigrams when the text was entirely Latin — that would be silly — but in a separate test I did see that it indexed Latin characters as ordinary words. So one of the worries that I had — that mroonga would be only good for CJK but not for non-CJK mixed inside the same text column — turned out not to be a worry.

And now, the essential point, the “l’essence du bigram” (which would look good on a restaurant menu, eh?) …

MariaDB [test]> SELECT COUNT(*) FROM t WHERE MATCH(text_column) AGAINST ('免从');
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

It found a row with a bigram! But is that the right count?

MariaDB [test]> SELECT COUNT(*) FROM t WHERE text_column LIKE '%免从%';
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
1 row in set (2.48 sec)

Yes, it’s the right count — a LIKE search confirms it.

The difference is that the LIKE search took 2.5 seconds because it had to scan a million rows. The mroonga full-text search took approximately zero seconds because it used an index. That doesn’t mean that mroonga is fast — not every search is a benchmark. I did find a graph that shows mroonga throughput is better than MyISAM’s or SphinxSE’s but probably if I looked hard I could find a graph that says the opposite.

And the crowd goes wild

Although I have done no benchmarks or bug hunts, I have acquired the impression that mroonga is capable of doing what its makers say it ought to be able to do.

The Third Most Popular Open Source DBMS

We all know that MySQL says it is “the world’s most popular open-source database”. And PostgreSQL has a firm hold on second place while claiming to be, instead, “the world’s most advanced open source database”. But the horse that comes in third can return some money to gamblers who bet “to show”. And the horse that shows momentum or gets close is worth watching for next time.

So I’ll just ignore the dolphin and the elephant in the room, and go on to a harder question: who’s number three?

According to Wikipedia

To find out how many times someone has expressed interest in a topic, I can go to stats.grok.se and ask how many times someone has looked at that topic’s page in Wikipedia. Evil-thinking people could manipulate these numbers with ease, but until now they have had no reason to do so.

So here is the count of accesses on the Wikipedia pages of the nine contending SQL DBMSs, for September last year (blue) and September this year (red).

third_most_popular_1

In absolute terms, MariaDB is only slightly ahead. But if you look at momentum (what direction the hit counts have gone between September 2012 and September 2013), interest in MariaDB has risen a great deal while interest in most other DBMSs has declined.

Now here is the chart for open-source NoSQL DBMSs, made the same way. This time I can’t be comprehensive — there are 150 NoSQL products! I’ll show only the ten with the most hits.
thirrd_most_popular_2
Of course MongoDB is in front. The surprise is that MongoDB’s in-frontness isn’t glaring.

According to The Register

To find out how many times the mass media mentioned a DBMS, I can go to google.com and enter site:theregister.co.uk [DBMS name] -mozilla. Simply googling the DBMS name would fail because Google seems to become more inexact as numbers get bigger, and because “Ingres” and “Firebird” and “Drizzle” and “Cassandra” have alternative meanings that have nothing to do with databases.

So this term is more specific, it’s asking: how many times does search-term appear in The Register (in a non-Mozilla context)? For this purpose I’m positing that “the mass media” and the well-known medium “The Register” are synonymous terms.
third_most_popular_3

According to Ohloh

To find out how much activity an open-source project has, I can go to ohloh.net and look up the count of commits in the last year. (Ohloh “provides a single aggregate source of information about the world’s open source projects from over 5,000 repositories and forges including GitHub, SourceForge and Google Code, as well as open source foundation repositories including the Eclipse, Mozilla, Apache, and Linux.”) This method has been justly criticized, but is an indicator for developer enthusiasm. Again I am only showing the top ten, other than MySQL or PostgreSQL, for any kind of open-source DBMS.
third_most_popular_4

According to db-engines

An Austrian company named solid IT maintains a site named db-engines.com. On it there is “a list of DBMS ranked by their current popularity … updated monthly.” Their ranking is based on (a) Google and Bing hits, (b) Google Trends, (c) Stack Overflow and DBA Stack Exchange (counting the number of technical discussions), (d) indeed.com and simplyhired.com (counting the number of job offers), (e) LinkedIn (counting the number of people who mention the product in their profiles). In other words, none of their criteria match the criteria that I prefer. If one excludes closed-source and MySQL and PostgreSQL from their list, their top 10 hits are:
third_most_popular_5

Woe and alas. If their results had been similar to mine, despite their different criteria, then that might have validated both our methods. But they’re very different. I think that’s because my criteria are the only valid ones, but obviously I’m not the only evaluator.

According to the makers

“We believe that there are more copies of SQLite in use around the world than any other SQL database engine, and possibly all other SQL database engines combined.”
sqlite.org

“Mountain View, Calif. – February 8, 2012 – Couchbase, Inc. [is] the NoSQL market share leader …”
couchbase.com

“Firebird® is the universal open source database used by approximately 1 million of software developers worldwide.”
facebook.com/FirebirdSQL

“MongoDB is the leading NoSQL database, with significant adoption among the Fortune 500 and Global 500.”
mongodb.com

Announcing the finishers

MariaDB is in third place among the open-source SQL DBMSs, and would be third overall if we gave more weight to the numbers “according to Ohloh” and “according to The Register”. But MongoDB and Cassandra nose past it “according to Wikipedia”, and MongoDB is definitely the third-place finisher “according to db-engines”. The claim of SQLite is strongest “according to the makers”.

We have a photo finish, with a blurry camera.

← Previous page