Month: July 2016

What’s in the SQL of NoSQL

In a previous post I said it’s bogus that NoSQL stands for Not Only SQL, but NoSQL products can have “some” SQL. How much?

To get past the SQL-for-Hadoop stuff I’ll just mine a few quotes: “Hive was the first SQL on Hadoop engine and is still the most mature engine.” “Apache Phoenix is a project which aims to provide OLTP style SQL on top of Apache HBase.” “Cloudera Impala and Apache Drill are the two most prominent Dremel clones.” “Oracle, IBM, and Greenplum have all retrofit their database engines to integrate with Hadoop in various ways.” There, that’s the history out of the way. With thanks to a comprehensive article series: The Truth about SQL on Hadoop”.

Of course I could add that it’s possible to take an SQL front and add a NoSQL back as an “engine”, as MariaDB did with Cassandra plus LevelDB, as PostgreSQL did with MongoDB.

Here’s what I see as the documented capabilities provided by NoSQL vendors themselves, or via the Dremel clones.

thing CQL Drill Impala OrientDB
SELECT + WHERE 1/2 yes yes yes
SELECT + GROUP BY no yes yes yes
SELECT + functions 1/2 yes 1/2 yes
SELECT + IS NULL no yes yes yes
Subqueries or Joins no 1/2 yes no
GRANT + REVOKE yes no yes yes
Stored SQL Routines no no no no
Collations no no 1/10

CQL (Cassandra Query Language) does what I expect: I can do the regular DML statements, but UPDATE can only handle one row at a time. There are no subqueries or joins or views or SQL functions, and the WHERE is restricted because I have to use a query that includes information that tells Cassandra what cluster to go to. All strings are ASCII or UTF-8.

Apache Drill has all the options I respect for SELECT, even windowing. But ithe updating is done outside SQL. There are lots of built-in functions that CQL lacks.

Impala, from Cloudera, is regarded as a good option for analytic queries.

OrientDB handles graph databases, which the OrientDB founder says is defined as “index-free adjacency” (that’s my first joke, I’m making sure you don’t miss them).

For my chart, I deliberately added criteria that I thought might make the DBMSs choke. Specifically I thought IS [NOT] NULL would cause trouble because in a flexible schema the data might be either “missing” = not stored at all or “unknown” = stored explicitly as null value … but usually this caused no problem. Specifically I thought collations would cause trouble because they must affect either storage or performance … and they certainly did, CQL and Impala are byte-value-comparison-only and OrientDB has a grand total of two collations (case sensitive or insensitive). Most of the limitations are understandable for a DBMS that does clustering and searching quickly, but that’s not mom-and-pop stuff. I will end this paragraph with my second joke. NoSQL doesn’t scale … down.

Sometimes the vendors get it that when you use SQL you should use ANSI SQL, for example it’s typical that identifiers can be enclosed within double quotes — far better than MySQL/MariaDB’s ANSI_QUOTES requirement. On the other hand I see that Impala has SHOW statements — just as bad as MySQL/MariaD. And I’ve no idea where OrientDB came up with the syntax “ORDER BY … SKIP”.

On the other hand, NoSQL can be expected to be better at flexible schemas. And it is, because some of the SQL vendors (I’m especially thinking of the work-in-progress “dynamic column” stuff in MariaDB and the JSON functions in MySQL) are still catching up. But they will catch up eventually. That’s why I like the Gartner Report on DBMSs predicting

By 2017, the “NoSQL” label will cease to distinguish DBMSs, which will result in it falling out of use.

Only 5 months left to go, eh?

Another NoSQL DBMS, Tarantool, is also going to have SQL real soon now. Anti-disclosure: I do some work for the vendor but I had nothing to do with the SQL planning.

The idea behind Tarantool/SQL is: instead of struggling to add a piece at a time to a home-grown parser, just fork the one in SQLite. Poof, there’s SQL syntax which, as I discussed in an earlier post, is more than an insignificant subset of the standard core features. The actual data maintenance job is done by Tarantool’s multi-user server.

Here is an ocelotgui screenshot of a Tarantool/SQL query showing the usual advantages that one gets when the language is SQL: syntax highlighting, predicting, and columnar output.