Month: January 2016
Bewilderedly behold this SQL statement:
SELECT * FROM a.b.c;
Nobody can know the meaning of “a.b.c” without knowing the terminology, the context, and the varying intents of various DBMS implementors.
It’s pretty clear that a.b.c is a name, that is, a unique reference to an SQL object, which in this case is a table.
The name happens to be qualified — it has three parts separated by periods. A generalist would say that each part is a container, with the first part referring to the outermost container a, which contains b, which contains c. Analogies would be: a directory with a subdirectory and a sub-subdirectory, or an element within a struct within a struct.
That is true but incomplete. Now we fight about what kind of container each
identifier refers to, and what is the maximum number of levels.
What kind of container
The standard suggests that the levels are
[catalog.] [schema.] objectAnd here are the possible ways to do it:
1. Ignore catalog and schema, the only legal statement is “SELECT * FROM c;”, so c is a table name and there is no qualification. This is what happens in Firebird.
2. Ignore catalog, the legal statements are “SELECT * FROM b.c;” or “SELECT * FROM c;”, so b is a schema name, or the schema name is some default value (in MySQL/MariaDB this default value is specified by the USE statement, in standard SQL by the SET SCHEMA statement). In this case the schema identifier is the same as the database identifier. This is what happens in MySQL/MariaDB, and in Oracle. I am ignoring the fact that MySQL/MariaDB has a single catalog named def, since that has no practical use.
3. Ignore nothing, the legal statements are “SELECT * FROM a.b.c;” or “SELECT * FROM b.c;” or “SELECT * FROM c;”. The outermost (catalog) container is for a server. This is what happens in DB2.
4. There is a fourth level. This is what happens in SQL Server.
Naturally the variety causes trouble for JDBC or ODBC applications, which have to connect to any sort of DBMS despite the contradictory meanings. But when you’re connecting to the same DBMS every time, it can be okay.
The standard meaning of catalog
In SQL-99 Complete, Really — which is reproduced on mariadb.org — the idea of a catalog is expressed as a named container in an “SQL-environment”. Okay, but that leaves some uncertainty. We know that an SQL-environment can mean “the server and the client” or “the server as seen by the client”. And beyond that, we see the words “implementation-defined” because different vendors have different ideas.
The standard is neutral. Optional Feature F651 “Catalog name qualifiers” says simply that there can be a catalog identifier, there can be a “SET CATALOG ‘identifier’;” statement which has “no effect other than to set up default information”, there is no defined way to create a catalog (that is, there is no official analogue for CREATE SCHEMA), and the rest is up to the implementor.
The PostgreSQL idea is that the catalog is the cluster. The word “cluster” appears in database literature but it’s vague, it doesn’t have to mean “a group of servers connected to each other and known to each other which co-operate”. But that’s how PostgreSQL interprets, so “SELECT * FROM a.b.c;” implies “in cluster a, in which there is a schema b which should be uniquely defined within the cluster, there is a table c”.
The DB2 idea is that the catalog is the server, or an instance of the server. Presumably, if there’s only one server in a cluster, this would be the same as the PostgreSQL idea. So “SELECT * FROM a.b.c;” implies “in server a, in which there is a schema b, there is a table c”. This might be what C.J.Date was talking about when he wrote, in A Guide To The SQL Standard: “Or different catalogs might correspond to databases at different sites, if the SQL-environment involved some kind of distributed processing.” This also might be what the Hitachi manual is implying when it says a table can be qualified as node_name . authorization_identifier . table_name.
The Microsoft idea is pretty well the same thing, except that SQL Server considers “server” to be at the outermost (fourth) level.
Oracle, also, can specify a server in a table name; however, Oracle does it by appending rather than prefixing, for example “schema.table@database_link”.
In all cases the levels are separated by dots or at-signs, so it’s clear what the parts are. Astonishingly, there is a NoSQL DBMS that uses dot separators
but allows unquoted names to include dots(!); mercifully I’ve only seen this once.
Speaking of NoSQL, Apache Drill’s levels for a pseudo-table-name are plugin . workspace . location. That is an excellent way to specify what plugin to use, but I don’t see how it would work if there are two instances of a plugin.
Default schema name
The default schema name is typically implementation-defined when “the schema” is synonymous with “the database”. For example, MySQL and MariaDB say it’s null, but it’s common to start off by saying –database=test or “USE test”.
The default schema name can be the user name when the qualification is “table-qualifier . table-owner . table-name”. For example, if you log in as joe, then “SELECT * FROM t;” will be interpreted as “SELECT * FROM joe.t;”. This is becoming uncommon, but you’ll still see hints about it in old guides to ODBC, and in the Hitachi manual that I mentioned earlier.
The folks at Microsoft have an interesting BNF:
server . [catalog] . [schema] . table
| catalog . [schema] . table
| schema . table
… which is not the same as the standard’s:
[catalog .] [schema .] table>
… See the difference? It means that one can leave catalog-name and schema-name blank if they’re default. So this is legal:
SELECT * FROM RemoteServer…t;
Alas, it’s easy to misread and think “[schema].table” is legal too. The MySQL manual is apparently alluding to that, but without trying to explain it, when it says: “The syntax .tbl_name means the table tbl_name in the default database. This syntax is accepted for ODBC compatibility because some ODBC programs prefix table names with a “.” character.”
Below the bottom level
What happens if an object has sub-objects? I don’t mean columns — it’s obvious that “SELECT x.y FROM t AS x;” assumes there’s a column y in table t — but I do mean partitions. As far as I know, nobody has partition identifiers within table names, although sometimes it’s okay to put them as hints after table names.
I also should mean sub-parts of columns, as one could find when (for example) the column is XML or JSON. But that’s a big enough subject for separate blog posts.
What is the use?
I was thinking of qualifiers when considering an enhancement to ocelotgui, our GUI client for MySQL and MariaDB. Suppose that one could have more than one DBMS connection. This isn’t possible with the regular mysql client, but we can go beyond it. So suppose we connect to two different DBMSs, or suppose we connect to the same DBMS twice. The first hurdle is that the usual parameters (–host, –port, –user, –password) are scalar, so we have to supplement them with new options, such as –connection2_host, –connection2_port, –connection2_user, –connection2_password. That’s easy, but now, if the user enters
SELECT * FROM b.c;
how do we know which connection the user wants to use, the one specified by –host or the one specified by –server2_host?
This could be done by saying “following statement is to be passed to connection X”, in a separate command or a menu choice. However, I rather like the idea of making catalog = server, so that “select * FROM b.c;” means send to the default server, “select * from server2.b.c;” means send to second server, and “USE server2.b” would change the default catalog as well as the default schema. This is easy and is all done by the client, so there’s no question it would work, the only thing to resolve is whether it’s a good idea.
Whatever happens, it will be in the next version of ocelotgui, along with a few bug fixes and other features. That next version will be 0.9, probably our first beta, in February 2016. But I do hope that people won’t wait — version 0.8 alpha is safe and good-looking and easy to download from github.