Month: January 2014


MySQL, GROUP BY, Select lists, and Standard SQL

The MySQL manual says: “In standard SQL, a query that includes a GROUP BY clause cannot refer to nonaggregated columns in the select list that are not named in the GROUP BY clause.”

I got questioned about that last week, and this week I see that there’s a proposed change in behaviour for MySQL 5.7, so I guess it’s a good time to try to figure out what’s going on in this part of the world.

I checked the manuals for most current DBMSs and confirmed that the ONLY_FULL_GROUP_BY kind of behaviour is required by DB2, Derby, Firebird, HSQLDB, Informix, Mimer, Oracle, SQL Server, and Teradata. The only exceptions are Sybase and SQLite, and I’ll mention PostgreSQL later.

But returning an error is not really absolutely always the right thing.

Suppose your table was made this way:
CREATE TABLE t (a INT PRIMARY KEY, b INT);
And suppose you make this query:
SELECT a, b FROM t GROUP BY a;
This causes an error if sql_mode = ONLY_FULL_GROUP_BY.
But is it really an error?
The trick here is: if column a is the PRIMARY KEY (or a UNIQUE NOT NULL key), and you’re grouping by a, then it is logically impossible — impossible — that there could be two different values of column b.

One could say that “a determines b”, or that there is a “functional dependency” here. Roland Bouman talked about the matter in 2007 in an article “Debunking GROUP BY myths”. He made the point that, in fact, the current SQL standard does not really absolutely always require that there be an error if there’s a functional dependency.

So let’s look a bit more closely at what the standard says now. Sample:

2) Without Feature T301, “Functional dependencies”, in conforming SQL language, if QCR is a group-invariant column reference, then QCR shall be a reference to a grouping column of the qualifying query of QCR.

Arm-waving our way past the terminology, this kind of thing means that ONLY_FULL_GROUP_BY behaviour is a requirement — for any implementation that doesn’t claim to support Feature T301. Feature T301, eh? That’s not a core feature, in other words no SQL DBMS is required to support T301 in order to claim ANSI/ISO conformance. But it is standard. So one could argue that the MySQL manual is slightly wrong, it shouldn’t suggest that “the standard” requires ONLY_FULL_GROUP_BY. On the other hand, MySQL’s behaviour without ONLY_FULL_GROUP_BY is obviously non-standard, because there’s no checking whether something is “functionally dependent” or not.

What is to be done, then?

Well, it’s rather short notice to say that something will be in 5.7. There were occasional bitter experiences when incompatible behaviour was rushed into MySQL, and I thought the lessons were learned. I was wrong. I should have listened to Mr Kipling:

As it will be in the future, it was at the birth of Man
There are only four things certain since Social Progress began.
That the Dog returns to his Vomit and the Sow returns to her Mire,
And the burnt Fool’s bandaged finger goes wabbling back to the Fire.

Sure, but shouldn’t MySQL do something about that “functional dependency” stuff?

Well, PostgreSQL did. For the sort of item that I used in the example above, which by the way is called a “BPK-dependency” (base table primary key dependency), recent versions of PostgreSQL will say okay. As a result PostgreSQL now claims “partial” compliance with feature T301. But only partial!

The difficulty is that there are so many possibilities, as soon as you start to consider what should happen with views, left or right joins, subsets of multi-column keys, rollup, subqueries, and the many ways that a functional dependency can become known, besides the simple BPK case. Trudy Pelzer and I didn’t really get to grips with these arcana in our book “SQL-99 Complete, Really”, but Jim Melton and Alan Simon did in their book “SQL:1999 Understanding Relational Language Components”, for several dense pages. I’ll just quote the last sentences:

We realize that the material in this section is complex and abstract in the extreme. We have included it in spite of the complexity to serve readers who need to understand SQL:1999’s enhancements in updatability and who are sufficiently involved with data modeling to have or pursue the background in relational theory. However, in practical terms, we have rarely found ourselves using knowledge of all of these functional dependencies when writing our applications.

I think that it will be a long time before PostgreSQL will have full support, and this will cause confusion — people who understand the rationale will expect queries to work whenever they’re rational according to functional-dependency thoughts, and they won’t work. That doesn’t mean, though, that PostgreSQL is wrong — as I’ve said before, complaints about “half-baked features” are typically wrong. I see that the Derby folks are considering following the PostgreSQL lead. And maybe MySQL someday will too. There’s a worklog task, WL#2489, Fully compliant ONLY_FULL_GROUP_BY mode. Once again Oracle has hidden the task, once again it’s visible on wayback, but WL#2489 had no high-level architecture anyway. Well, that means the plan can be flexible.

And the bottom line is: MySQL should change, but no rush. Oracle should recall Saint Augustine’s prayer: “Lord, grant me chastity and continence, but not yet.” A variation could be: Lord, let me support standard SQL by default with Feature T301 … but not in 5.7.”

Roles Review

A role is a bundle of GRANTed privileges which can be assigned to users or which can take the place of users. When there are hundreds or thousands of users, administration becomes a horrible burden if the DBMS doesn’t support roles. Our online ANSI-standard book has syntax descriptions. Other DBMSs do support roles, and MySQL has had a worklog task “WL#988 Roles” for many years. Earlier attempts to implement them included a Google Summer of Code project and a MySQL tool released by Google.

Now another google-summer-coder, Vicențiu Ciorbaru, has put together something which will be in MariaDB. MariaDB’s official blog says this is “a big thing”. I’ll compare the MySQL specification to the MariaDB feature.

Unfortunately Oracle made WL#988 Roles a hidden task about a year and a half ago — you can’t find it by looking on dev.mysql.com/worklog. Fortunately the Wayback Machine had crawled the page in 2012 — you can find it by looking on http://web.archive.org/web/20120508072213/http://forge.mysql.com/worklog/task.php?id=988. Remember that worklog pages are tabbed, and you must click the “High Level Architecture” button to see most of the specification.

I’m using MariaDB 10.0.8 built from source. This is an early version and changes will happen before General Availability happens. Although MariaDB has its own worklog task which I’ll refer to as MDEV-4397, I prefer WL#988 for obvious reasons. For this review I’ll use the same order and same section names as WL#988 high-level architecture. It will be hard to follow unless you read WL#988 first.

CREATE ROLE

“CREATE ROLE role_1;” works. Good.

“CREATE ROLE IF NOT EXISTS role_2;” does not work. I’m ambivalent about this. I don’t like the IF NOT EXISTS clause myself, but I thought it was the way of the future, according to another MySQL Worklog task, WL#3129 Consistent clauses in CREATE and DROP.

Here’s the first bug: I get a crash if I use a bad name:

MariaDB [test]> CREATE ROLE ``;
ERROR 2013 (HY000): Lost connection to MySQL server during query

Notes about role names

MariaDB allows role names to be wildcards or privilege names, for example

MariaDB [test]> CREATE ROLE super;
Query OK, 0 rows affected (0.00 sec)
MariaDB [d]> create role '%';
Query OK, 0 rows affected (0.00 sec)

I regard that as a flaw. Somewhere, ages hence, somebody will get confused because these names have special meanings.

DROP ROLE

“DROP ROLE role_1;” works. Good.

“DROP ROLE IF EXISTS role_1;” does not work. Once again I’m ambivalent. WL#988 specifies that a NOT EXISTS clause should be permissible. On the other hand, “DROP USER IF EXISTS u;” doesn’t work either.

DROP ROLE does not take effect immediately. If a user has enabled role_x, then that user continues to have the privileges of role_x even after role_x is dropped. I regard that as a flaw. [UPDATE: it’s normal, see the comments.]

Here’s the second bug: DROP USER crashes if I specify a role name:

MariaDB [test]> CREATE ROLE role_55;
Query OK, 0 rows affected (0.00 sec)

MariaDB [test]> DROP USER role_55;
ERROR 2013 (HY000): Lost connection to MySQL server during query

GRANT privilege TO role_name

“GRANT SELECT ON t TO role_1;” works. Good.

“GRANT PROXY ON a TO role_1;” works too. I’m ambivalent yet again. As WL#988 says, it’s meaningless to grant a privilege to a role if it only can affect a user. There will always be an argument between people who think this must mean the user is making a mistake, and people who think it doesn’t do any harm.

The privilege change does not take effect immediately. Consider what WL#988 says:

“The privilege change takes effect for the next statement that any affected user executes. (A user is affected if he/she has an enabled role that is affected by the privilege change.)

That seems reasonable — after all, if I say “GRANT … TO user_x” then user_x gets the privilege immediately. But it doesn’t happen for roles. If I say “GRANT … TO role_x”, then role_x gets the privilege — but users who have enabled role_x do not. I regard this as a flaw. [UPDATE: my test was bad, I was not granting the same type of privilege to both the role and the user, see the comments.]

Here’s the third bug: if a user and a role have the same name, GRANT only works on the role. If I say

CREATE USER u_1;
CREATE ROLE u_1;
GRANT INSERT ON *.* TO u_1;
SELECT host,user,insert_priv,is_role FROM mysql.user WHERE mysql.user.user='u_1';

I’d expect that INSERT would be granted to both the user and the role. It’s not, it’s granted only to the role. Moral: users and roles should not have the same names!

GRANT role_name TO user_name

“GRANT role_1 TO user_1;” works. Good. “GRANT role_1 ON *.* TO user_1;” would be good too but an [ON *.*] clause should never be necessary.

“GRANT role_1 TO user_1 WITH GRANT OPTION;” does not work. I’m ambivalent yet again. What we actually have is an ability to say “GRANT role_1 TO user_1 WITH ADMIN OPTION;” and that’s okay (WITH ADMIN OPTION is a standard clause), but I’m not sure it’s a good idea that it’s the default.

“GRANT role_1, role_2 to user_1;” does not work. I regard this as a flaw, but it might have been tough to implement. (UPDATE: 2016-03-20: It is indeed tough to implement. Two months after this blog comment was made, since the manual said that it would work, this flaw was reported as a bug: MDEV-5772 Granting multiple roles in single statement does not work. But two years later it was still unresolved.)

Who has GRANT role_name privileges?

As WL#988 puts it:

Some People will say that, if Peter said CREATE ROLE Role1, Peter should automatically have GRANT Role1 privilege. The analogy is with the way that we grant for routines. This is what would happen with Oracle.
Other People will say that, no, the only way for Peter to get GRANT Role1 privileges is if somebody grants to Peter. The analogy is with the way that we grant for tables.

The implementer decided to follow the advice of Some People for this one. Good.

GRANT role_name TO role_name

“GRANT role_1 TO role_2;” works. Good.

“GRANT role_1 TO role_1;” does not work — and that’s good too. This is a special instance of what WL#988 calls a “cyclical GRANT”. Nobody would want that.

GRANT CREATE / ALTER / DROP ROLE

“GRANT CREATE ROLE …” does not work. I regard that as a flaw.

WL#988 has quite a bit of verbiage about how there has to be a CREATE ROLE privilege, and why. MariaDB sidesteps — the decision was that all you need is a CREATE USER privilege. Although I cannot think of a way that this could lead to security breaches, I am fairly sure that cases exist where administrators who want to allow user-creation do not want to allow role-creation, and vice versa.

REVOKE

“REVOKE role_1 FROM user_1;” works. Good.

SET ROLE

“SET ROLE role_1;” works. And it only works if somebody previously said “GRANT role_1 TO user_1;” for the user that’s doing the SET ROLE. Good.

“SET role ‘role_1′;” works, that is, the role name can be a string literal. But “SET @x=’role_1’; SET ROLE @x;” does not work. I regard that as a flaw because it’s a lack-of-orthogonality thing. By that I mean: if a string literal is okay in a certain place in an SQL statement, then The User On The Clapham Omnibus will expect that a string variable would be okay in the same place.

“SET ROLE role_1,role_2;” does not work. That’s because MariaDB, unlike Oracle, doesn’t allow assignment of multiple roles. And that’s what WL#988 asks for, so no complaints here.

SET ROLE DEFAULT does not work. I regard that as a flaw. Yes, I recognize there is no non-null default, but that just means that SET ROLE DEFAULT should have the same effect as SET ROLE NONE.

Big Example

The big example didn’t work due to incompatible syntax, but I see that there is some evidence that testing has been done. Good.

CURRENT_ROLE

“SELECT CURRENT_ROLE;” works. Good. As expected, it’s possible to find out what the last “SET ROLE” statement did.

“SELECT CURRENT_ROLE();” also works. I regard this as a flaw but I suppose it was inevitable. After all, “SELECT CURRENT_USER();” works.

CURRENT_ROLE is not on the current reserved words list and this will cause a minor incompatibility with MySQL.

ALTER ROLE

“ALTER ROLE role_1 …” does not work. Good.

ALTER ROLE exists in Oracle for changing the IDENTIFIED BY clause, but MariaDB doesn’t support the IDENTIFIED BY clause, so ALTER ROLE would have nothing to do.

RENAME ROLE

“RENAME ROLE role_1 TO role_2;” does not work. Good.

Although “RENAME USER user_1 TO user_2;” works, WL#988 reminds us that renaming a role is going to have cascading effects that wouldn’t exist for renaming a user.

SHOW PRIVILEGES

“SHOW PRIVILEGES;” is the same as it used to be. Good.

Since MariaDB doesn’t have a CREATE ROLE privilege, it doesn’t have to list it.

SET DEFAULT ROLE

“SET DEFAULT ROLE role_1 TO user_1;” does not work. I regard this as a flaw.

It’s a good thing if, when user_1 connects, user_1 automatically gets the privileges associated with a default role. It’s less good if user_1 has to do her own SET ROLE whenever she connects, probably with mysql –init_command=’SET ROLE role_1′. The point of roles was supposed to be the elimination of administrator hassle, and –init-command won’t help there.

The Initial State

The initial state when a user connects is, effectively, SET ROLE NONE. That’ll have to do for now.

SHOW

“SHOW GRANTS FOR user_1;” works and shows roles too. Good.

“SHOW GRANTS FOR CURRENT_ROLE;” also works. WL#988 failed to suggest this obvious addition, so maybe I should call it: better than good.

INFORMATION_SCHEMA

“SELECT * FROM INFORMATION_SCHEMA.ENABLED_ROLES;” works. Good.

“SELECT * FROM INFORMATION_SCHEMA.APPLICABLE_ROLES;” also works. Also good.

And it does show contained roles. That is, “If role1 is contained in role2, then both roles appear.” However: It only shows the roles that have been granted for the current user, it does not show all roles. I regard that as a flaw.
[UPDATE: In an earlier version of this blog this observation was in the wrong place.] [UPDATE again: I missed what the cause was, the real problem was that I couldn’t grant to public — but that’s another bug. See comments.]

Command line options

“–role=role_name” doesn’t work (on the command line). Good.

The way to go for setting a role name at connect time is to support defaults. Any other “solution” would just get in the way.

For Replication

As WL#988 says, “The plan is to ignore replication till the last minute.” I can’t tell whether MariaDB is following such a plan.

Plan for Backup

Backup is simplified by the fact that users and roles come from the same table in the mysql database. I did not test it, so let’s say: assumed good.

Stored Procedures and Definer Privileges

After “SET ROLE role_1; CREATE PROCEDURE p () SQL SECURITY DEFINER SELECT CURRENT_ROLE; CALL p();” the result is NULL. Good.

The idea is supposed to be that, within an SQL SECURITY DEFINER routine, an implicit SET ROLE NONE takes place temporarily. It’s great that the implementer caught on to this rather obscure point.

Logging in with a Role Name

After “CREATE ROLE ‘role_5@localhost’;” or “CREATE ROLE role_5;” I was not able to log in as role_5. Good.

However, “CREATE DEFINER=role_5 PROCEDURE p2() SET @x=5;” works. I regard that as a flaw. There should at least be a warning that the DEFINER clause is being used with a nonexistent user. I get the impression from MDEV-4397 that this is deliberate behaviour, so don’t expect a fix.

DML statements and role privileges

The Oracle-style restriction for CREATE VIEW isn’t there. Good.

Other worklog tasks affected by roles

Well, WL#988 anticipated that roles would come first, and pluggable authentication support would come later. It didn’t happen that way. No problem.

Glossary

It doesn’t seem that the terminology differs from what’s expected. Good.

New columns in mysql.user

When I say “CREATE ROLE role_1;” the effect is that a new row goes into the mysql.user table. The mysql.user table definition had to be adjusted to take that into account. But I don’t think it’s all good.

WL#988 expected that the length of a role name would be 16 characters, but mysql.user.user is now CHAR(80). Well, that was a flaw, but just an old bug that wasn’t caused by introduction of roles.

The mysql.user.host column is blank (”). I’d expected it would be ‘%’. I suppose that this was intended as a way to disambiguate users and roles that had the same names, but that’s not working well anyway. The column should be ‘%’ because that means “any or all hosts”.

There is one new column at the end, named is_role, which can be ‘Y’ or ‘N’. I regard that as a flaw. The future-considerations way would have been to add a new column named user_type, which could be NULL or ‘Role’ — or something else that somebody will dream up in a future release. The way it is now, if somebody dreams up another type later, they’ll have to add yet another column in mysql.user.

Storing role names

The implementer picked “Alternative #2: (loosely based on a suggestion from Monty Widenius)”. So there’s a new table in the mysql database:
CREATE TABLE mysql.roles_mapping (Host CHAR(60), User CHAR(80), Role CHAR(80), Admin_option ENUM(‘N’,’Y’))
… And it works. Good. Although ROLE was a reserved word in SQL-99, and although WL#988 suggested an additional column and different names, it works.
However, storing the WITH ADMIN OPTION value should only be in one place and that place is mysql.user, so mysql.roles.mapping should not contain a column named Admin_option. I regard this as a flaw. [UPDATE: not it is not a flaw, see the comments.]

The Decisions about Options

The early decisions of Monty Widenius were: we don’t want passwords, we do want defaults, we don’t want multiple current roles, and PUBLIC should not be a role. That’s what ended up happening, except that there are no defaults. Good.

Overall

For all of the bugs, and for some of the flaws, there’s no worry — they’ll probably disappear. The MariaDB advantage here is the release-early policy, because there’s lots of advance time if there’s a decision to make a few corrections or course changes.

The MariaDB disadvantage is illustrated by its worklog task for roles, MDEV-4397. I’m biased, but I think it’s fair to say that MySQL’s WL#988 “high level architecture” specification covered the necessary ground and MariaDB’s MDEV-4397 did not. This means Oracle made the right decision when it “hid” its worklog task. Too bad it didn’t work out this particular time.

Mroonga and me and MariaDB

Chinese and Japanese and Korean (CJK) text usually has no spaces between words. Conventional full-text search does its tokenizing by looking for spaces. Therefore conventional full-text search will fail for CJK.

One workaround is bigrams. Suppose the text is

册免从冘

There should be three index keys, one for each two-character sequence:

册免, 免从, and 从冘.

Now, in a search like

SELECT * FROM t WHERE MATCH(text_column) AGAINST ('免从');

a bigram-supporting full-text index will have a chance. It’s wasteful and there will be false hits whenever the bigram isn’t really a “word”, but the folks in CJK-land have found that bigrams (or the three-character counterpart, trigrams) actually work.

One way to get bigrams for MySQL or MariaDB is to get mroonga.

Why care about Yet Another Storage Engine)?

Back in 2008 a project named Senna attracted the attention of my colleagues at MySQL, but didn’t go on to world conquest.

Since around 2011 the groonga project, billed as a “successor” to Senna, has been outing regular releases of a generic library that can connect to more than one DBMS. “Mroonga” is the “M[ySQL storage engine that interfaces with the] groonga [code]”. So, although it hasn’t been packaged with MySQL or MariaDB until now, it’s not all new or untested code. What’s new is that MariaDB will, real soon now, include mroonga as part of the regular MariaDB download. Poof, credibility at last.

I understand that mroonga has features that make it interesting:
* It’s a column-storage engine, which I suppose makes it an alternative to Infobright or InfiniDB
* Some of its users or developers also have an involvement with the Spider storage engine, which I suppose means there wouldn’t be too much trouble using the two engines in concert: mroonga for full-text and Spider for sharding.
… but I didn’t look at the column storage or the Spider compatibility. I cared only that SHOW ENGINES said it’s “CJK-ready fulltext search”.

Of course, SHOW ENGINES can also show MyISAM and InnoDB and SphinxSE, and they can all do full-text searching too. One might choose MyISAM because for a long time that was the only engine that had full-text, or choose InnoDB because it’s the default nowadays, or choose SphinxSE because it has lots of features including stemming. But those aren’t targeted for the CJK niche. For example, MariaDB has no bigrams. Sphinx’s documentation says it does have bigrams, but a glance shows instantly “those are not the bigrams you’re looking for”.

The only questions, then, are (a) does mroonga really handle full-text, and (b) does mroonga really handle CJK?

Does mroonga work?

At the time I’m writing this, MariaDB 10.0.8 doesn’t have mroonga yet. There are instructions for getting packages with Windows and Ubuntu etc., but such things never work for me — they’ll always be out of synch with something else that I’ve got, or with what the main vendor (Oracle or MariaDB in this case) is updating. So I decided to build from source. Getting mroonga was easy since I already have Bazaar:

bzr branch lp:~mroonga/maria/mroonga

and then build. Actually the initial build failed because mroonga required CMake 2.8.8 which doesn’t come with Ubuntu 12.04. But after I worked around that, I had a server and all I had to say was

                                                                             
INSTALL PLUGIN mroonga SONAME 'ha_mroonga.so';
CREATE FUNCTION last_insert_grn_id RETURNS INTEGER SONAME 'ha_mroonga.so';
SET GLOBAL mroonga_log_level=NONE;

I found more out by looking at some documentation, which is nowhere near Oracle quality but is mostly in readable English.

Then I was able to do this:

CREATE TABLE t (
  id INT NOT NULL DEFAULT '0',
  `text_column` VARCHAR(100) CHARACTER SET utf8 DEFAULT NULL,
  PRIMARY KEY (id),
  FULLTEXT KEY `text_column` (text_column)
) ENGINE=mroonga DEFAULT CHARSET=latin1 COMMENT='engine "innodb"'

It’s probably less than obvious that I’m looking at a good-looking feature. The point is: with this syntax I can use mroonga with an InnoDB table. Whoopie. Although that means I lose mroonga’s other features, I can use it without losing whatever I find good about InnoDB. And then I was able to do this:

SELECT * FROM t WHERE MATCH(text_column) AGAINST ('x');

Again it’s probably less than obvious that I’m looking at a good-looking feature. The MATCH … AGAINST syntax is idiosyncratic and unobvious — don’t expect this to arrive in an SQL standard near you any time soon. But it’s what the long-term MyISAM user is used to, so the transition isn’t painful.

Does mroonga handle CJK?

The J (Japanese) part is certainly there. The developers are Japanese. One of the optional extras is integration with MeCab which is a tool that can handle Japanese morphology — it’s like solving the “no spaces” problem by understanding a bit about the Japanese language, which after all is the solution that a human would use with no-spaces text. I don’t see, though, that there’s been equivalent attention paid for the C (Chinese) and K (Korean) parts of CJK. So I just looked at the bigrams, since they’re all that C or K could benefit from.

I created a million-row table containing randomly-chosen characters, mostly Kanji, but with a mix of Latin and Kana letters. How I made it is tedious, but perhaps somebody out there will want to know how to make randomly populated tables with such “data”, so I’ll put the code in a comment on this posting.

I had a choice between a bewildering variety of ways to tokenize. The default one, TokenBigram, did the job. The good news for me was that it didn’t do bigrams when the text was entirely Latin — that would be silly — but in a separate test I did see that it indexed Latin characters as ordinary words. So one of the worries that I had — that mroonga would be only good for CJK but not for non-CJK mixed inside the same text column — turned out not to be a worry.

And now, the essential point, the “l’essence du bigram” (which would look good on a restaurant menu, eh?) …

MariaDB [test]> SELECT COUNT(*) FROM t WHERE MATCH(text_column) AGAINST ('免从');
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

It found a row with a bigram! But is that the right count?

MariaDB [test]> SELECT COUNT(*) FROM t WHERE text_column LIKE '%免从%';
+----------+
| COUNT(*) |
+----------+
|        1 |
+----------+
1 row in set (2.48 sec)

Yes, it’s the right count — a LIKE search confirms it.

The difference is that the LIKE search took 2.5 seconds because it had to scan a million rows. The mroonga full-text search took approximately zero seconds because it used an index. That doesn’t mean that mroonga is fast — not every search is a benchmark. I did find a graph that shows mroonga throughput is better than MyISAM’s or SphinxSE’s but probably if I looked hard I could find a graph that says the opposite.

And the crowd goes wild

Although I have done no benchmarks or bug hunts, I have acquired the impression that mroonga is capable of doing what its makers say it ought to be able to do.