Uncategorized – SQL And Its Sequels

Category: Uncategorized

Column Order in SQL Tables

Read The language rules we know – but don’t know we know which says “Adjectives in English absolutely have to be in this order: opinion-size-age-shape-colour-origin-material-purpose Noun.” Then you’ll know this is correct:

CREATE TABLE little_girls (
  grade FLOAT, height SMALLINT, birth DATE, fatness BLOB,
  hair SET('blond','black'), previous_school VARCHAR(64),
  is_made_of_sugar_and_spice_and_everything_nice BOOL,
  reason_for_existence TEXT DEFAULT 'thank heaven');

because columns are attributes and so are adjectives.

Now I’ll tell you the other factors that affect whether chicken columns precede egg columns.

Primary key columns first

As the Oracle “Ask Tom” guy said unenthusiastically in 2012: “For example – most people – pretty much all people – would put the primary key FIRST in the table. It is a universally accepted approach – convention.” Sometimes this is part of a “rule of thumb”. Sometimes it’s part of “logical data modeling standards”. Sometimes it’s part of “business rules”.

I’ve even seen an “automated analysis of database design” tool that marks you down if you put the primary key somewhere else.

General before specific

The United States Postal Service says “Automated mail processing machines read addresses on mailpieces from the bottom up and will first look for a city, state, and ZIP Code.” because those folks know that what matters first is the larger area and what matters last is the street spot.

That’s also what people care about so often they’ll say
genus before species,
or schema name before object name,
or test type before test result.

Specificity is not the same as uniqueness — there might be more addresses “100 Main Street” then there are states in the USA. But it is about moving from the target to the bullseye. Unique-first might be an advantage for CREATE INDEX, but not for CREATE TABLE. And that might mean you might want to say CREATE TABLE t (column1, column2 PRIMARY KEY (column2, column1)) but it does not mean you want to reverse the order of display.

Surname first

Looking at the States stats again, there are about 5 million occurrences of the most common given name (James) and about 2.5 million occurrences of the most common last name (Smith). That’s anecdotal but corresponds to the general belief that last names are uniquer than first names. What matters more is that the name James is less useful for identification, and therefore less important.

Alphabetical

If you’ve ever heard “Choose one from column A and one from Column B” you’ll feel this isn’t quite absurd, and if you’ve got 26 columns named column_a through column_z then you’ll appreciate it if there’s an arbitrary-but-familiar way to glance through them.

And in fact it’s not arbitrary if your column names have prefixes that indicate what group they belong to. Surely user_name and user_address belong together, or if it’s the address suffix that matters more then alphabetical DESC would work though I’ve never seen it proposed.

Important before trivial

There’s advice that columns should appear in order of importance, and if users don’t know what’s important then there are clues.

Is the value usually the same or usually NULL? Then it carries little information.

Is the column name absent from most SELECT statements? Then most people don’t care.

Is it indexed, or unique, or in a foreign key? Then somebody has already decided it matters. Accept that and bump them closer to the start.

What matters for SELECT * is:
the columns that you like most should be on the left of the screen (even if you’re Arab) so you don’t need to use horizontal scrolling.

What matters for SELECT column_x is:
if the row is so big that it might stretch over two blocks, the DBMS might be happier if the column is in the first block.

The ancient wisdom

There’s a rule that I’ve seen often, with occasional variations:
“Primary key columns first.
Foreign key columns next.
Frequently searched columns next.
Frequently updated columns later.
Nullable columns last.
Least used nullable columns after more frequently used nullable columns.
Blobs in own table with few other columns.”

What I’ve seen much less often is a link to what appears to be the first occurrence: https://community.hpe.com/t5/operating-system-hp-ux/oracle-columns/td-p/2840825. It’s from 2002 and it’s in response to a question about Oracle.

So, for example, if your DBMS happens to be one that doesn’t store NULLs if they’re the last things in the row, then that’s good advice. If on the other hand your DBMS happens to be one that has a preface of bits indicating whether the column is NULL, then it’s pointless. Documentation about this might for one InnoDB case be slightly misleading; however, my point is that most of those items will depend on your DBMS and the clauses that you use for a storage engine.

Nevertheless I’ll endorse the “NULLs last” tip because I’ve seen an old claim that it affects SQL Server too. That is, if it’s been a rule for more than one DBMS, even if it’s pointless, maybe it affects compatibility because others still think it’s real.

Old SQL Server rules, for what they’re worth, also include “fixed fields first” and “short before long” but they’re obsolete. Also if you’ve got InnoDB’s compact format then lengths of variable columns are in the header. So when you see claims that you have to journey through the varchar columns till you reach the char columns, be skeptical.

Autocompletion

Suppose in the past you said

CREATE TABLE n (s1 INT, s2 INT);

and now (after rehash) you start an INSERT statement:

INSERT INTO n (

at this point the ocelotgui autocompleter will present you with a hint and you can save a fraction of a second by selecting the first item so you don’t have to navigate.

Unfortunately you won’t save anything when the statement is SELECT, though, because the SELECT list comes before the FROM clause. That is, the GUI won’t know what table you’re using until you’ve already specified the columns.

Speaking of irritating design, I take this opportunity to mention the syntax:

INSTALL PLUGIN plugin_name SONAME 'plugin_library';

Notice again how the GUI can’t hint about a name at the appropriate point, because it won’t know the library yet. I class this as a violation by the vendor of the advice “general before specific”.

Compression

If there is a “run” of bytes with the same value, then a “run-length encoding” (RLE) algorithm might save space. Thus if if it is likely that 20 columns will all contain 0, then keeping them together is a good idea.

But it’s only an idea. The practical considerations are that (a) there’s no compression (b) the typical compression is some variant of Lempel-Ziv which wouldn’t require the columns to be exactly contiguous, (c) it’s hard to predict what algorithms and byte placements will actually save the most space without trying them all (d) they’re not all available unless you build from source.

Consistency

Consistency between tables: if table #1 has columns in a particular order, then table #2 should too. Imitate the INFORMATION_SCHEMA tables because they are likely to be static, although the collations of string columns are likely to change. I’d add “imitate examples in the manual”, but only if the examples in the manual are themselves consistent. For example the MySQL manual example

CREATE TABLE pet (name VARCHAR(20), owner VARCHAR(20),
       species VARCHAR(20), sex CHAR(1), birth DATE, death DATE);

has no explicit primary key, but the columns might be in order of importance and the final column is the one most likely to contain NULL. For example the MariaDB manual example

CREATE TABLE test.accounts (
   id INT PRIMARY KEY AUTO_INCREMENT,
   first_name VARCHAR(25),
   last_name VARCHAR(25),
   email VARCHAR(100),
   amount DECIMAL(15,2) CHECK (amount >= 0.0),
   UNIQUE (email)
) ENGINE=InnoDB;

has the PRIMARY KEY clause mixed in the column definition but the UNIQUE clause outside the column definition, and has first name before surname.

Consistency between statements: if WHERE clauses often have WHERE a = 5 AND b = 6, or if ORDER BY clauses often have A, B ASC, or if INSERT statements often have (a, b), then the CREATE TABLE definition can serve as a guide by letting users know that a comes before b by design.

Consistency between definitions: if table #1 has one foreign key referencing table #X and then another foreign key referencing table #Y, and table #2 has one foreign key referencing table #y and then another foreign key referencing table #X, then ask yourself why the order of foreign keys in table #2 is not the same as the order of foreign keys in table #1. Ideally the order will be reflected in entity-relationship diagrams.

ocelotgui changes

The ocelotgui 2.4 release is available for download from github.

The debugger feature won’t work with MariaDB 11.5ff for reasons I’ll explain elsewhere. However, if you download the source and build, it will work. Or wait for ocelotgui 2.5 which (I hope) will be released soon.

pgulutzan, August 7, 2024. Category: MySQL / MariaDB, Standard SQL, Uncategorized.

Short note re Progress and NuSphere

Progress Software on 2024-04-19 said more about “considering” an offer for MariaDB plc (the company not the foundation).

They own NuSphere which had a dispute with MySQL AB which was settled in 2002. My happy history as a MySQL employee biases me but I thought that NuSphere was not acting angelically.

I think it won’t happen.

Update: Apparently it didn’t. The 2024-04-26 K1 recommended offer differed.

pgulutzan, April 23, 2024. Category: MariaDB, Uncategorized.

Make SHOW as good as SELECT even with InnoDB

Here’s a snippet of what I’d like SHOW ENGINE INNODB STATUS to look like:

as opposed to what the server sends:

...
--------
FILE I/O
--------
Pending flushes (fsync): 0
295 OS file reads, 1 OS file writes, 1 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
...

In my last post I described a way to
Make SHOW as good as SELECT”
which was possible because most result sets from SHOW etc. are at least table-like, but
STATUS’s wall of text isn’t table-like. So I forced it into a table with these basic rules:

A line inside ——s, like FILE I/O, is the category of what follows.
Otherwise a line is a row but if it contains commas it is multiple rows.
Numbers can usually be extracted from text as different columns..

After that there’s still a bunch of fiddling, I put the details in source-code comments.

Version 2.3

The new features related to SHOW etc. are now in a released version as well as in source code,
downloadable from github.

In the rest of this post I’ll show a complete result from SHOW ENGINE INNODB STATUS; (“before”),
and the same data from SHOW ENGINE INNODB STATUS WHERE 1 > 0; after ocelot_statement_syntax_checker
has been set to ‘7’ (“after”). (Instead of copying the Grid Widget I copied from the History Widget
after setting Max Row Count to 100.)

“before”

| InnoDB |      | 
=====================================
2024-03-19 12:39:45 0x7f80f01f3700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 43 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 0 srv_active, 2455 srv_idle
srv_master_thread log flush and writes: 2454
----------
SEMAPHORES
----------
------------
TRANSACTIONS
------------
Trx id counter 754
Purge done for trx's n:o < 0 undo n:o < 0 state: running but idle
History list length 0
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION (0x7f8111334680), not started
0 lock struct(s), heap size 1128, 0 row lock(s)
---TRANSACTION (0x7f8111333b80), not started
0 lock struct(s), heap size 1128, 0 row lock(s)
--------
FILE I/O
--------
Pending flushes (fsync): 0
295 OS file reads, 1 OS file writes, 1 OS fsyncs
0.00 reads/s, 0 avg bytes/read, 0.00 writes/s, 0.00 fsyncs/s
---
LOG
---
Log sequence number 463193
Log flushed up to   463193
Pages flushed up to 362808
Last checkpoint at  362808
----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 167772160
Dictionary memory allocated 853016
Buffer pool size   8064
Free buffers       7647
Database pages     417
Old database pages 0
Modified db pages  164
Percent of dirty pages(LRU & free pages): 2.033
Max dirty pages percent: 90.000
Pending reads 0
Pending writes: LRU 0, flush list 0
Pages made young 0, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 273, created 144, written 0
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 417, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]
--------------
ROW OPERATIONS
--------------
0 read views open inside InnoDB
state: sleeping
----------------------------
END OF INNODB MONITOR OUTPUT
============================
 |

“after”

pgulutzan, March 19, 2024. Category: MySQL / MariaDB, Uncategorized.

Make SHOW as good as SELECT

For example, this works:

SHOW AUTHORS GROUP BY `Location` INTO OUTFILE 'tmp.txt';

You’re thinking “Hold it, MySQL and MariaDB won’t allow SHOW (and similar statements like ANALYZE or CHECK or CHECKSUM or DESCRIBE or EXPLAIN or HELP) to work with the same clauses as SELECT, or in the same places.” You’re right — but they work anyway. “Eppur si muove”, as Galileo maybe didn’t say.

I’ll explain that the Ocelot GUI client transforms the queries so that this is transparent, that is, the user types such things where SELECTs would work, and gets result sets the same way that SELECT would do them.

Flaws and workarounds

I’ll call these statements “semiselects” because they do what a SELECT does — they produce result sets — but they can’t be used where SELECT can be used — no subqueries, no GROUP BY or ORDER BY or INTO clauses, no way to way to choose particular columns and use them in expressions.

There are three workarounds …

You can select from a system table, such as sys or information_schema or performance_schema if available and if you have the privileges and if their information corresponds to what the semiselect produces.

For the semiselects that allow WHERE clauses, you can use the bizarre “:=” assignment operator, such as

SHOW COLUMNS IN table_name WHERE (@field:=`Field`) > '';

and now @field will have one of the field values.

You can get the result set into a log file or copy-paste it, then write or acquire a program that parses, for example by extracting what’s between |s in a typical ASCII-decorated display.

Those three workarounds can be good solutions, I’m not going to quibble about their merits. I’m just going to present a method that’s not a workaround at all. You just put the semiselect where you’d ordinarily put a SELECT. It involves no extra privileges or globals or file IO.

Example statements

CHECK TABLE c1, m WHERE `Msg_text` <> 'OK';

SELECT * FROM (DESCRIBE information_schema.tables) AS x ORDER BY 1;

SHOW COLLATION ORDER BY `Id` INTO OUTFILE 'tmp.txt';

SELECT `Type` FROM (SHOW COLUMNS IN Employees) AS x GROUP BY `Type`;

SELECT UPPER(`Name`) from (SHOW Contributors) as x;

SHOW ENGINES ORDER BY `Engine`;

(SELECT `Name` FROM (SHOW CONTRIBUTORS) AS x
UNION ALL SELECT `Name` FROM (SHOW AUTHORS) AS y)
ORDER BY 1;

CREATE TABLE engines AS SHOW ENGINES;

How does this work?

The client has to see where the semiselects are within the statement. That is easy, any client that can parse SQL can do it.

The client passes each semiselect to the server, and gets back a result, which ordinarily contains field names and values.

The client changes the field names and values to SELECTs, e.g. for SHOW CONTRIBUTORS the first row is

(SELECT 'Alibaba Cloud' AS `Name`, 'https://www.alibabacloud.com' AS `Location`, 'Platinum Sponsor of the MariaDB Foundation' AS `Comment")

and that gets UNION ALLed with the second row, and so on.

The client passes this SELECT to the server, and gets back a result as a select result set.

Or, in summary, what the client must do is: Pass the SHOW to the server, intercept the result, convert to a tabular form, send or SELECT … UNION ALL SELECT …; to the server, display.

However, these steps are all hidden. the user doesn’t have to care how it works.

Limitations

It requires two trips to the server instead of one. The client log will only show the semiselect, but the server sees the SELECT UNION too.

It will not work inside routines. You will have to CREATE TEMPORARY TABLE AS semiselect; before invoking a routine, in order to use the semiselect’s result set inside CREATE FUNCTION | PROCEDURE | TRIGGER.

Speaking of CREATE TEMPORARY TABLE AS semiselect, if there are VARCHAR columns, they will only be as big as the largest item in the result set.

It will not work inside CREATE VIEW.

Sometimes it will not work with nesting, that is semiselects within semiselects might not be allowed.

Some rare situations will expose the SELECT result in very long column names.

Try it today if you can build from source

On Linux this is easy — download libraries that ocelotgui needs, download ocelotgui, cmake, make. (On Windows it’s not as easy, sorry.) The source, and the README instructions for building, are on github.

After you’ve started up ocelotgui and connected to a MySQL or MariaDB server, there is one preparatory step: you have to enable the feature. (It’s not default because these aren’t standard SQL statements.) You can do this by going to the Settings|Statement menu and changing the Syntax Checker value to 7 and clicking OK. Or you can enter the statement

SET OCELOT_STATEMENT_SYNTAX_CHECKER = '7';

Now the feature is enabled and you can try all the examples I’ve given. You’ll see that they all work.

Of course it’s made available this way because the status is beta.

Try it a bit later if you can’t build from source

This will be available in executable form in the next release of ocelotgui, real soon now. If you have a github account, you can go to the github page and click Watch to keep track of updates.

Update: the release happened on 2024-03-18, source and executables are at https://github.com/ocelot-inc/ocelotgui.

pgulutzan, February 29, 2024. Category: MySQL / MariaDB, Uncategorized.

Packages in MariaDB default mode

MariaDB 11.4 has a new feature: CREATE PACKAGE with routine syntax for the default mode as opposed to sql_mode=’Oracle’. It’s a well-written and long-desired feature but, since it’s alpha, a few things might still need change. I’ll say how it works, with details that aren’t in the manual and probably never will be.

The point

A package is a group of routines (procedures or functions) for which I can CREATE and GRANT and DROP as a unit, all at once.

Roland Bouman wrote a feature request for it in 2005 for MySQL, but MySQL hasn’t got it yet, the workaround is to create whole databases. MariaDB has had CREATE PACKAGE since version 10.3 but only when sql_mode=’oracle’, and only with Oracle syntax (“PL/SQL”) for defining the routines.

Now MariaDB has CREATE PACKAGE with the default sql_mode, i.e. anything except sql_mode=’oracle’, and with ordinary standard-like syntax (“SQL/PSM”) for defining the routines. But it’s a bit of a hybrid because, although the routine definitions within the package are SQL/PSM, the CREATE PACKAGE statements themselves are not.

Package versus Module

CREATE PACKAGE is a PL/SQL statement. CREATE MODULE is the SQL/PSM statement for something functionally very similar.

Here I compare the way MariaDB creates packages versus the way the standard prescribes for modules. I ignore trivial clauses that appear in most CREATE statements.

The MariaDB way

+------------------------------------------------------------+
| CREATE PACKAGE package_name                                |
| [ COMMENT or SQL SECURITY clause ... ]                     |
| [ FUNCTION | PROCEDURE name + COMMENT or SQL clauses ... ] |
| END                                                        |
+------------------------------------------------------------+

  +-------------------------------+ 
  | CREATE PACKAGE BODY           |
  | [ variable declaration ... ]  |
  | | routine definition ... ]    |
  | END                           |
  +-------------------------------+

The standard way

+-------------------------------------+
| CREATE MODULE module_name           |
| [ NAMES ARE character_set_name ]    |
[ [ SCHEMA default_schema_name ]      |
[ [ path specification ]              |
| [ temporary table declaration ... ] |
|  [DECLARE] routine-definition; ...  ]
|  END MODULE                         |
+-------------------------------------+

The most prominent vendor with CREATE PACKAGE is of course Oracle, but others, for example PostgreSQL and IBM, have it too.

The most prominent vendor with CREATE MODULE is IBM but Mimer has it too.

The basic example

So the absolute smallest example of statements that have all the relevant features is:

CREATE PACKAGE pkg1
 PROCEDURE p1();
 FUNCTION f1() RETURNS INT;
END;
CREATE PACKAGE BODY pkg1
  DECLARE var1 INT;
  FUNCTION f1() RETURNS INT RETURN var1;
  PROCEDURE p1() SELECT f1();
  SET var1=1;
END;
SELECT pkg1.f1();
CALL pkg1.p1();
SHOW CREATE PACKAGE pkg1;
SHOW CREATE PACKAGE BODY pkg1;
GRANT EXECUTE ON PACKAGE db.pkg TO PUBLIC;
DROP PACKAGE pkg1;

Documentation and Terminology

In the Canadian Football League there used to be an official term “non-import” for a player who, essentially, wasn’t from the States or Europe or Samoa etc. This caused some complaint because there were simpler terms, like, um, “Canadian” or “national” i.e. native.

Eventually the League realized that adding “non-” was being negative about the default player situation.

I was reminded of that when reading the MariaDB manual, which now has split up the sections for CREATE PACKAGE and CREATE PACKAGE BODY to put “Oracle mode” and “non-Oracle mode”. I am hopeful that someday MariaDB, like the Canadian Football League, will come up with a less negative term such as “default”, or “when sql_mode is the default”. Also I am hopeful — here I speak as the former head of documentation for MySQL — that there will be rearrangement so that the default is shown first, as it will be more important than sql_mode=’oracle’, won’t it?

Another change will happen soon — perhaps by the time you read this — to the BNF. Currently it is

CREATE
    [ OR REPLACE]
    [DEFINER = { user | CURRENT_USER | role | CURRENT_ROLE }]
    PACKAGE [ IF NOT EXISTS ]
    [ db_name . ] package_name
    [ package_characteristic ... ]
    [ package_specification_element ... ]
END [ package_name ]

… which is wrong, adding [ package_name ] after END will just cause an error.

And later

package_specification_function:
    func_name [ ( func_param [, func_param]... ) ]
    RETURN func_return_type
    [ package_routine_characteristic... ]

… which is wrong, it should be RETURNS not RETURN.

Also, since CREATE FUNCTION documentation says “RETURNS type” not “RETURNS func_return_type”, there’s no need to introduce a new term here.

As for CREATE PACKAGE BODY the default mode BNF is undocumented, only Oracle mode BNF is documented. So my description above might be missing some detail, for example maybe it’s possible somehow to declare package-wide cursors and handlers as well as variables.

Error messages

I see two package-related error messages in sql/share/errmsg-utf8.txt

"Subroutine '%-.192s' is declared in the package specification but is not defined in the package body"

and

"Subroutine '%-.192s' has a forward declaration but is not defined"

… which is wrong, there is no such thing as a subroutine, the term is “routine”. (Oracle has a thing called “subprogram” but it too would be a wrong term.)

After I create a package named pkg6 with a procedure p1, if I say

DROP PROCEDURE pkg6.p1;

I get told “PROCEDURE pkg6.p1 does not exist”.
… which is wrong, pkg6.p1 does exist, I can CALL it. It would be better to re-use the message “The used command is not allowed with this MariaDB version”. (Yes, it’s a statement not a command, but I can’t ask for the moon.)

If I say

GRANT EXECUTE ON PACKAGE no_such_package TO PUBLIC;

I get told “FUNCTION or PROCEDURE no_such_package does not exist”
which is wrong, I’m trying to grant on a nonexistent package not a nonexistent routine.

Qualifiers

Suppose we have a package named pkg containing a procedure p1. “CALL p1();” is legal inside another routine in the same package, but outside the package we have to add a qualifier: “CALL pkg.p1();”.

Here is an example that shows why this is dangerous. (Delimiters added so mysql client understands.)

DROP DATABASE pkg;
DROP PACKAGE pkg;
CREATE DATABASE pkg;
CREATE PROCEDURE pkg.p1()
  SELECT 'database';
CALL pkg.p1();
DELIMITER $
CREATE PACKAGE pkg PROCEDURE p1(); END;
$
DELIMITER ;
DELIMITER $
CREATE PACKAGE BODY PKG PROCEDURE p1()
  SELECT 'package'; END;
$
DELIMITER ;
CALL pkg.p1();

…
The first “CALL pkg.p1();” will display “database”, the second “CALL pkg.p1();” will display “package”. The package has shadowed the database!

People can avoid the danger by adopting a naming convention that database names and package names will always have different prefixes, but they won’t.

Or people can “fully” qualify the package’s P1 by saying “CALL [database_name.].[package_name.].p1();”. But they cannot “fully” qualify the database’s P1 by saying “CALL [catalog_name.][database_name.]p1();” — you’ll see a CATALOG_NAME column in INFORMATION_SCHEMA tables, but it is useless.

Therefore MariaDB should emit a warning message when there’s ambiguity, or support a different qualifier syntax. I’m hopeful that will happen in some future version.

By the way, Mimer “solves” this by disallowing: “The module name is never used to qualify the name of a routine.” It’s unstated, but I suppose this would mean that no two procedures can have the same name in the same schema, even if they are in different packages of the schema.

Also the standard allows SCHEMA and PATH which might be another way to evade the ambiguity, but it’s not necessary.

Metadata

The obvious question after creation is: how can I see what’s in a package?

SHOW CREATE PACKAGE works. SHOW CREATE PACKAGE BODY works.
SHOW PACKAGE STATUS works. SHOW PACKAGE BODY STATUS works.
But they’re SHOW statements and therefore they’re no good.

In INFORMATION_SCHEMA.ROUTINES the package will appear with routine_type = ‘PACKAGE’ and routine_definition = ‘procedure pkg(); end’.
This is odd because
(a) a package is not a routine
(b) there is no procedure named pkg
(c) the actual routine is not a row in information_schema!
I can dig the routine out of another row that has routine_type = ‘PACKAGE BODY’ but I can do it because I have an SQL parser available, other people would be stalled because the body is a mishmash of routines and contents.

Similar cluttering occurs for mysql.proc, although at least there I see PROCEDURE and FUNCTION entries. Remember that the the ‘body’ field might be blank unless you have appropriate privileges.

The obvious answer, similar to what the standard has, is: put routines in INFORMATION_SCHEMA.ROUTINES, and add a PACKAGE_NAME column. Probably something needs to be added to mysql.proc too. Until that happens, since SHOW is not useful, getting metadata for package routines is awkward.

The answer hasn’t appeared in code yet but I’ll assume that what’s obvious will happen.

Variables

I can declare variables that are accessible from all routines in the package. This is possible in CREATE PACKAGE BODY and alas might soon be in CREATE PACKAGE too, if this is done.

Here is an illustration.

DELIMITER $
CREATE OR REPLACE PACKAGE BODY pkg
  -- variable declarations
  DECLARE a INT DEFAULT 11;
  DECLARE b INT DEFAULT 10;
  FUNCTION f1() RETURNS INT
  BEGIN
    SET a=a-1;
    RETURN a;
  END;
  -- routine declarations
  PROCEDURE p1()
  BEGIN
    SELECT a,f1(),a;
  END;
  -- package initialization section
  SET a=a-b;
END;
$
DELIMITER ;

And the question is: what should “CALL pkg.p1();” display?

If you guessed 1, 0, 0 then good for you, but notice what’s unpleasant here. First: we have a procedure’s variable’s value being changed in a way that the procedure doesn’t see. Second: the value changes between the first time it’s selected and the second time it’s selected, in the same statement.

Now, This won’t startle any experienced person, since MariaDB user variables (the ones whose names start with ‘@’) have always worked that way. But I can’t think of any case where that can happen with a DECLAREd variable, so it might startle people who have only worked with standard-like syntax.

I like globals, but I am just expecting that some people will consider it should be noted in a style guide. One of the suggestions I’ve seen (for Oracle) is that package variables are a way to do “constants”. I must emphasize, though, that I’m only talking about what some people might like in style guides, and I’m recognizing that many more people will see an advantage to sharing dynamic variables.

Private routines

Suppose I say

CREATE PACKAGE pkg12 PROCEDURE p1(); END;
CREATE PACKAGE BODY pkg12
  PROCEDURE p0() SELECT 5;
  PROCEDURE p1() CALL p0();
END;
CALL pkg12.p1() /* This succeeds */;
CALL pkg12.p0() /* This fails */;

Thus p0 is not in CREATE PACKAGE but p0 is in CREATE PACKAGE BODY. That is legal provided p0 comes before p1 (no forward references please). In this case p1 is a “public” routine — I can CALL pkg12.p1() from outside the package. However, p0 is a “private” routine — I cannot CALL pkg12.p0() from outside the package. I will see “Error 1305 (42000) PROCEDURE pkg12.p0 does not exist”.

Nothing against private, but since pkg12.p0 does exist, I think a message that’s more explicit would help somebody in ages to come. Otherwise, it should be made obvious. Probably a naming convention would be a good way to do that. A comment would not be a good way because many clients, including mysql and ocelotgui, have –skip-comments as a default.

Privileges

To allow CREATE PACKAGE (example);

GRANT CREATE ROUTINE ON w2.* TO k@localhost;

To allow EXECUTE of a package (example):

GRANT EXECUTE ON PACKAGE w2.pkg TO k@localhost;

This is a good thing, the usual privileges affecting routines will affect packages, as a whole. It’s a bit odd that a qualifier is necessary for GRANT but not for CALL; however.

To allow SHOW CREATE PACKAGE (example):

GRANT EXECUTE ON PACKAGE w2.pkg TO k@localhost;
GRANT ALTER ROUTINE ON PACKAGE w2.pkg TO k@localhost;

This is a strange thing, currently one way to make SHOW CREATE possible is to grant ALTER ROUTINE.

ALTER

MariaDB has eleven ALTER statements, but ALTER PACKAGE is not one of them. Given that Oracle has one, and DB2 has ALTER MODULE, and it’s mentioned in a MariaDB document, I expect this will eventually be added with an excuse of “orthogonality”.

Debugger

The debugger in the Ocelot GUI does not yet work with routines inside packages. However, in a version which will be released soon, the “recognizer” will see MariaDB 11.4 syntax and be able to alert typists about what syntax is expected as they type, the same experience that they get for other statements.

This enhancement is already in the source code, in this patch.

Update

2024-02-15: MariaDB has done some fixes for the documentation matters that I mentioned, and there are plans for others in their bugs database, including: MDEV-33382 Documentation fixes for CREATE PACKAGE and CREATE PACKAGE BODY, MDEV-33384 ALTER PACKAGE [ BODY ] statement, MDEV-33385 Support package routines in metadata view, MDEV-33386 Wrong error message on `GRANT .. ON PACKAGE no_such_package ..`, MDEV-33395 I_S views to list stored packages, MDEV-33399 Package variables return a wrong result when changed inside a function, MDEV-33403 Document stored packages overview, MDEV-33428 Error messages ER_PACKAGE_ROUTINE_* are not good enough. For further developments follow MariaDB’s announcements. Meanwhile ocelotgui 2.2 has appeared which recognizes the new syntax.

pgulutzan, January 31, 2024. Category: MariaDB, Uncategorized.

Version 2.1

Ocelotgui, the Ocelot Graphical User Interface, version 2.1, is now available for download. The principal new feature is the chart modifications described in the last post, Charts. Packages are for Linux or Windows, compatible DBMSs are MySQL or MariaDB or Tarantool. Get description and instructions and clone source from GitHub Code area, or get the release from GitHub Releases area.

pgulutzan, September 29, 2023. Category: MySQL / MariaDB, Uncategorized.

Draw Database Diagrams

It’s pleasant to admire drawings of database tables and their foreign-key relationships, which are “data structure diagrams” or more grandiosely “entity relationship (ER) diagrams”. I’ll show what the esthetic considerations are i.e. “does this look nice”, and what the engineering considerations are i.e. “what do I have to do to make it at all”. I’ll begin with the conclusion — here is a drawing of the well-known sakila database.

To find diagrams that other products draw for the same database, just google “sakila er diagram”. You’ll see something like this:

and what I’m trying to address first is: who’s the fairest?

Esthetics

Esthetics, also spelled aesthetics, is “the formal study of the principles of art and beauty”. And there really have been formal studies where academics look for criteria that people judge are important. A fine description is in a Radboud University thesis by Bart van de Put, “Scoring Entity-Relationship Diagrams Drawn by a Computer Algorithm”. I will consider five criteria that the thesis mentions:

Bends when a line makes a sharp direction change.

Crossings when two lines intersect or nearly intersect.

Edge non-orthogonality when lines are not all horizontal or vertical with right angles.

Node non-orthogonality when tables are not in a fixed rank-and-file grid.

Inconsistent line lengths when some of a table’s relationships are short and others are long.

Although a computer program may not be great as a beauty contest judge, it can simply count bends and crossings, assign numeric weights to them, and come up with a negative score. Then picking for display a drawing with the best score can be achieved with brute-force comparisons or with functions that are slightly more sophisticated but enormously harder to write.

Bends

Unlike what happens in most of the google-result diagrams I mentioned above, I just declare “there are no bends”. All lines are straight, with many more angles than 90-degree up/down/left/right.

For example this is a table that had foreign-key relationships with 8 other tables. Bend count: zero.

Let’s admit the downsides too:
(a) The original idea of structure diagrams was that lines could be done with as few as four ASCII characters – | > < and so were easy to reproduce. Drawing angles is perhaps technically possible with ASCII art but would need more screen space.
(b) In the situations where the classic style would bend around, the straight-line style will draw right through a table. This can be illustrated with a sakila diagram that has all the columns, and the line between rental and inventory goes right over the staff table.

There’s an option to put the line in the background instead

but either way, there’s an esthetic problem which should be weighted the same as a count of bends, or higher.

(c) we can’t put labels on any non-horizontal lines, so the only way to know the foreign-key constraint is to hover over the line and wait for a tooltip to appear.

Crossings

Unlike the monochrome google-result diagrams I mentioned above, I just declare “different lines have different colours”. This has limited possibilities since the palette has to consist of only colours that are easily distinguishable and are not similar to the background colour, which would have a camouflage effect. So the practical limit is about a dozen colours. However, usually, when two lines are close or intersecting it’s easy to distinguish them.

Let’s admit the downside here too:
(a) this doesn’t reduce the number of crossings, it just patches over the problem
(b) it’s a matter of chance whether a bunch of colours is esthetic or not.

Edge non-orthogonality

The algorithm produces straight lines and almost no right angles so it is almost always edge non-orthogonal. As the Radboud University thesis says, edge non-orthogonality is considered a negative. But the thesis also admits that there are unnecessary bends otherwise. I’m giving bends more weight. I see in other products that edge orthogonality has more weight.

Node non-orthogonality

The algorithm places high weight on node orthogonality and one result is that tables are in fixed ranks and files, always. Again, this is unlike what I see in other products. I suppose that their criterion is compactness: if there are fewer spaces between tables, then there is less scrolling necessary to see the whole picture. So this is another case of a beautification with a downside, but I insist on it because of another consideration: specifiability.

Users will sometimes want to specify, instead of “ERDIAGRAM OF whole_schema”, “ERDIAGRAM OF schema TABLES (table-list)”. Once we allow that much, it’s reasonably simple to allow for users to say which tables should go in which fixed positions. For example, consider this specification
SHOW ERDIAGRAM OF sakila TABLES (staff 0 1, inventory 1 0, rental 2 1)
which results in this

Users can figure out graphic coordinates such as 0 1 “position 0 on the X axis position 1 on the Y axis”. They would be , however, if the coordinates had to be pixel positions instead. Having straight ranks and files makes it possible to specify with small integers.

By the way, the automatic production of drawings based on databases is sometimes called “reverse engineering”. So I suppose the appropriate term for specifying a diagram like this is “forward engineering”, but the real engineering job is of course the CREATE TABLE statements.

Positioning

Suppose we have three tables A B C, and A references C, well then we wouldn’t want to draw B between A and C, we’d want the diagram to look like A C B. In other words, if we could just get related tables to be drawn close to each other with no intervening tables, we’d have no bends. For a tiny number of tables this is easy to see and to fix. But as the number of tables grows the possible number of possible table positions in the drawing grows geometrically, like a travelling salesman problem. So positioning is far more difficult than bend denialism or crossings anti-camouflage.

The Radboud University thesis mentions a tried-and-mathematically-solid way to get the fewest bends with a large number of related tables. I saw it too late, though, so the method I implemented might be inferior. Also I might have saved time by bringing in the Open Graph Drawing Algorithm library, but didn’t.

The method is like crystal formation.

Start by drawing the table A that has the most relationships, either referencing-from or referenced-to. Then, for each table that A is related to, draw nearby (as closely as possible without interfering with other tables) the related tables … starting with the table B which is related to A which has the most relationships, and so on until there are no more. This involves a lot of recursion.

The initial result of a crystal formation is sometimes pretty good, and then there are last-microsecond patchups that can enhance it. Look for pairs of tables that are not near each other and see whether, if they were exchanged with each other or if one of them was moved close to the other, the total bend count for the table would go down. This usually helps just a little bit and adds more than a little bit to the computer time, so it can only go on for a limited number of looks.

Touchup

So much for the esthetics that are specifically for ER diagrams. I suppose I should add that there have to be further esthetic features that apply to widgets in general. The diagram must be changeable with respect to fonts, header and detail background colours, and line widths.

For example, for this diagram of a limited number of tables in sakila I specified that column names ending with _id should be in bold, the lines should be extra wide, and header background should be yellow.

(One thing users can’t change is the way that an arrow looks, the so-called Crow’s foot. At least one other product offers choices for this.)

I imagine that other high-class GUI clients can do such touchups so this is nothing special, but in a post about esthetics I figure I ought to mention this final step.

But where’s the SQL?

In this case the SQL statements are the easy part, especially if the DBMS supports SQL-standard INFORMATION_SCHEMA access.

All the necessary information about foreign-key relationships can be picked up with
SELECT … FROM INFORMATION_SCHEMA.KEY_COLUMN_USAGE WHERE …
There’s a quirk that is non-standard: your favourite DBMS insists that primary-key constraints must be named PRIMARY, so KEY_COLUMN_USAGE has all we need for deciding which columns are part of the a primary key, as well as which columns are part of a foreign key. And if the user specifies that the drawing should have all columns instead of just primary-key columns,
SELECT … FROM INFORMATION_SCHEMA.COLUMNS WHERE …
does the job.

However, I don’t think it’s a good idea for the program to do those SELECTS every time the user asks for a drawing. They’re expensive on the server. (Rule: taking time on a client computer is trivial, taking time on the server might not be.) So this is done as part of an “explorer”, which means the user can slowly slowly pick up all the table and column and relationship information at once, so it’s stored in the local computer’s memory, and subsequent requests to draw don’t affect the server at all unless the user decides to refresh potentially-obsolete information.

ocelotgui

The feature I’ve just described comes with ocelotgui 1.9 which is available for download on github.

pgulutzan, March 9, 2023. Category: Uncategorized.

Foreign Key Displays

For foreign keys, I like to see what references what, It’s easy to find but the challenge is to decide how to display.

The cute

Showing a result with the mysql client:

prompt> call show_fk('t2');
+-------------+-----------------+------------+
| referencing | constraint_name | referenced |
+-------------+-----------------+------------+
| t2          | mx              | t3         |
| s1          |                 | s1         |
| s2          |                 | s2         |
|             |                 |            |
| t2          | t2_ibfk_1       | t1         |
| s1          |                 | s1         |
+-------------+-----------------+------------+
6 rows in set (0.08 sec)

Query OK, 6 rows affected (0.08 sec)

That is, for table T2, I show the table name and foreign-key name and referenced-table name, then the columns of that table in the order they appear in the foreign key, then if there is another foreign key repeat after a blank line.

Here is a long but simple stored procedure that produces such a display.

DROP PROCEDURE show_fk;
CREATE PROCEDURE show_fk(ref_table_name VARCHAR(128))
BEGIN
 DECLARE d_table_name VARCHAR(128);
 DECLARE d_column_name VARCHAR(128);
 DECLARE d_ordinal_position INT;
 DECLARE d_constraint_name VARCHAR(128);
 DECLARE d_referenced_table_name VARCHAR(128);
 DECLARE d_referenced_column_name VARCHAR(128);
 DECLARE counter INT DEFAULT 0;
 DECLARE err INT DEFAULT 0;
 DECLARE x CURSOR FOR
 SELECT table_name, column_name, ordinal_position,
 constraint_name,
 referenced_table_name, referenced_column_name
 FROM information_schema.key_column_usage
 WHERE table_name = ref_table_name AND referenced_column_name IS NOT NULL
 ORDER BY constraint_name, ordinal_position;
 DECLARE CONTINUE HANDLER FOR NOT FOUND SET err = 1;
 DECLARE CONTINUE HANDLER FOR SQLEXCEPTION set err = 1;
 CREATE TEMPORARY TABLE fks (referencing VARCHAR(128),
                             constraint_name VARCHAR(128),
                             referenced VARCHAR(128));
 OPEN x;
 WHILE err = 0 DO
   FETCH x INTO d_table_name,d_column_name, d_ordinal_position,
              d_constraint_name,
              d_referenced_table_name, d_referenced_column_name;
   IF err = 0 THEN
     IF counter <> 0 AND d_ordinal_position = 1 THEN
       INSERT INTO fks VALUES ('', '', '');
     END IF;
     IF d_ordinal_position = 1 THEN
       INSERT INTO fks VALUES (d_table_name, d_constraint_name, d_referenced_table_name);
     END IF;
     INSERT INTO fks VALUES (d_column_name, '', d_referenced_column_name);
     SET counter = counter + 1;
   END IF;
 END WHILE;
 CLOSE x;
 SELECT * FROM fks;
 DROP TABLE fks;
END;
CALL show_fk('t2');

This is vaguely like an entity-relationship diagram, but with tables rather than pictures.

The flaws are: (1) it needs extra privileges, (2) it mixes different object types. So let’s look at the simpler and more common type of display.

The usual

Everything necessary can come from information_schema.key_column_usage.

For example, there’s a GUI that displays with these columns:

That’s easy to reproduce with

SELECT constraint_name AS `Name`,
 table_schema AS `Schema`,
 table_name AS `Table`,
 column_name AS `Column`,
 referenced_table_schema AS `Referenced Schema`,
 referenced_table_name AS `Referenced Table`,
 referenced_column_name AS `Referenced Column`
FROM information_schema.key_column_usage
WHERE referenced_column_name IS NOT NULL
ORDER BY constraint_name, ordinal_position;

Or if that takes too long to type, make a view.

There’s another GUI that displays with these columns:

(The “On UPDATE” and “On DELETE” values would have to come from information_schema.referential_constraints.)

The objection that I’d make is that such headers are not standard. So anybody who knows the actual column names has to do a double take, wondering whether the first column is the same as “constraint_name” or something exotic, and so on. Use of multiple different names for the same thing is poetry not programming.

The new

So I think this display, which admittedly makes cosmetic changes (replacing ‘_’ with ‘ ‘ and changing upper case to mixed case and emphasizing one column) is better:

I’ve made it so that can come from user statements or from the explorer. The source code is downloadable now and the released executables will come soon.

pgulutzan, January 1, 2023. Category: Standard SQL, Uncategorized.

Offset and fetch clauses in MariaDB

A new feature in MariaDB 10.6 is “Implement SQL-standard SELECT … OFFSET … FETCH”.

I’ll say what’s not in the manual: mention other vendors, add details, talk style, show how to get the wrong answer, describe WITH TIES, mention alternatives, experiment with locks, and list 11 minor flaws.

Other vendors

Products with support for OFFSET … FETCH:
Oracle, DB2, SQL Server .

Products with support for LIMIT … OFFSET:
Tarantool, MySQL, Vertica, PostgreSQL.

Products with support for neither:
Firebird, ScyllaDB, Actian.

This isn’t a complete list — there are lots of SQL products nowadays — but it does include the “Big Three”. All of them support SELECT … OFFSET … FETCH.

Syntax details

The BNF actually is:

[ OFFSET start { ROW | ROWS } ]
[ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES } ]

The “start” and count” must be unsigned integers if they’re literals, but if they’re declared variables then they can be NULL (in which case they’re treated as zeros), or can be negative (in which case they’re treated as -infinity for “start” and +infinity for “count”) (but depending on undefined behaviour isn’t always fun).

The lowest offset is zero which is the first row, and the lowest count is zero, that is, it is possible to select zero rows. If count is omitted, then 1 is assumed.

Style

If you haven’t read the descriptive SQL style guide and clicked like, then you won’t understand this section, skip it.

OFFSET and FETCH are both either “clause starts” or “sub-clause starts” so if the choice is to put separate clauses on separate lines, then they are.

Deciding whether to use ROW rather than ROWS, or FIRST rather than NEXT, (neither of which make a difference for server behaviour), one could base a rule on English grammar:
(1) If there was an OFFSET clause with a non-zero number-of-rows, then NEXT, else FIRST. If number-of-rows is greater than one, or unknown, then ROWS, else ROW.
or one could base a rule on what comes first in the MariaDB manual BNF:
(2) The BNF has {FIRST|NEXT} and {ROW|ROWS} so FIRST and ROW.
or one could base a rule on what the examples in the manual use most often:
(3) FIRST and ROWS
or one could base a rule on whatever is shortest:
(4) NEXT and ROW.

I like (3), for example

  OFFSET 5 ROWS
  FETCH FIRST 1 ROWS ONLY;

but there’s no established rule yet so I’ll vary the format in what follows.

How to get the wrong answer

Try this:

CREATE TABLE td (char_1_column CHAR(1) DEFAULT '1',
                 null_column INT,
                 char_2_column CHAR(2));
CREATE INDEX int_char_1_column_plus_char_2_column
  ON td (char_1_column, char_2_column);
INSERT INTO td (char_2_column)
  VALUES ('a'),('b'),('c'),('d'),('e'),('ab');
SELECT char_2_column, null_column
  FROM td
  WHERE char_1_column > '0'
  OFFSET 0 ROWS
  FETCH FIRST 4 ROWS ONLY;
SELECT char_2_column, null_column
  FROM td WHERE char_1_column > '0'
  OFFSET 4 ROWS
  FETCH FIRST 4 ROWS ONLY;

Result of first SELECT:
(‘a’,NULL),(‘ab’,NULL),(‘b’,NULL),(‘c’,NULL)

Result of second SELECT:
(‘e’,NULL),(‘ab’,NULL)

So there’s no (‘d’,NULL) and there’s a duplicate (‘ab’,NULL). Wrong answer.

I don’t think it’s a bug because, in the absence of an ORDER BY statement, there is no guarantee which rows will be skipped in the second SELECT. But luckily it’s due to an exotic situation which is easy to avoid: the combined index on a non-unique column and a multi-character string column. Just don’t use the index, and all is well.

WITH TIES

The effects of the new syntax are mostly the same as with LIMIT … OFFSET, but WITH TIES is new.

Suppose there are 6 rows, selected in order (there must be an ORDER BY clause):
(NULL),(NULL),(1),(1),(2),(2).
Then OFFSET 0 ROWS FETCH NEXT 1 ROWS WITH TIES will gives us (NULL),(NULL).
Thus WITH TIES doesn’t mean that the two values are technically “equal”, they merely must be peers, so NULLS tie. The interesting thing is that when you ask for “NEXT 1 ROW” you get the next 2 rows. But that only applies to FETCH, it doesn’t apply to OFFSET, therefore this could be a mistake:

SELECT * FROM t
  ORDER BY pk
  OFFSET 0 ROWS
  FETCH NEXT 1 ROWS WITH TIES;
SELECT * FROM t
  ORDER BY pk
  OFFSET 1 ROWS
  FETCH NEXT 1 ROWS WITH TIES;

With the example 6 rows, the second SELECT will return a row that was already returned for the first SELECT. It would be better to say, instead of “OFFSET 1 ROWS”, “OFFSET (number of rows returned by the previous SELECT) ROWS”.

Alternatives

A standard way to fetch only one row:

BEGIN NOT ATOMIC
  DECLARE v INT;
  DECLARE c CURSOR FOR SELECT * FROM t;
  OPEN c;
  FETCH NEXT FROM c INTO v;
END;

A non-standard way to fetch only one row: (remember to restore @@sql_select_limit later):

SET sql_select_limit = 1;
SELECT * FROM t;

A way to offset by 10 and fetch 10 (this assumes that pk has all the integers in this range i.e. it depends on user knowing about contents and indexes, and assumes that user only wants to go forward without skipping)

SELECT * FROM t
WHERE pk > 10 AND pk <= 10 + 10;

A way to let the user do the offsets and fetches (this depends on the client being a GUI and the user knowing how to click on a scroll bar)

SELECT * FROM t;

… although users often only look at the first few pages (think of how often you use Google and only click Next a few dozen times).

Locking

Will locking, or SET TRANSACTION ISOLATION LEVEL, guarantee consistency?

For a test, I started two clients. On the first, I said:

CREATE TABLE pks (pk INT PRIMARY KEY);
BEGIN NOT ATOMIC
  DECLARE v INT DEFAULT 0;
  WHILE v <= 10000 DO
    INSERT INTO pks VALUES (v);
    SET v = v + 1;
  END WHILE;
END;
SET @@autocommit=0;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
START TRANSACTION;
SELECT * FROM pks
  WHERE pk > 400
  OFFSET 10 ROWS
  FETCH FIRST 2 ROWS ONLY FOR UPDATE;

On the second, I said:

SET @@autocommit=0;
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
START TRANSACTION;
DELETE FROM pks WHERE pk = 7227;
DELETE FROM pks WHERE pk = 405;
DELETE FROM pks WHERE pk = 411;

Result: The first DELETE succeeds, the second DELETE hangs until I say ROLLBACK; on the first client. The third DELETE would also have hung if I had said it first.

In other words: the rows between 401 and 410 (the rows skipped due to OFFSET) are locked, and the rows between 411 and 412 (the rows picked up by FETCH) are locked, but the rows after 412 (the rest of the rows matching WHERE) are not locked.

What this means, of course, is that ordinarily you can’t be certain that you’re paging through a consistent set. If this is a concern (it shouldn’t always be), make a copy on either the server or the client.

Minor Flaws

A few incompatibilities with standard SQL or another DBMS or another bit of MariaDB 10.10 syntax.

Table t has 4 rows created thus:

CREATE TABLE t (s1 INT, s2 INT);
INSERT INTO t VALUES (1,2),(2,1),(3,2),(4,1);

WITH x AS (SELECT * FROM t OFFSET 5 ROWS) SELECT * FROM x;

Result: Success, all 4 rows in table t are returned. It appears that the OFFSET clause is ignored.

BEGIN NOT ATOMIC
  DECLARE var1 DECIMAL(0);
  SELECT * FROM t OFFSET var1 ROWS;
END;

Result: Error 1691 (HY000) A variable of a non-integer based type in LIMIT clause
Never mind whether var1 should be accepted. The error message is obsolete, I was not using a LIMIT clause.

BEGIN NOT ATOMIC
  DECLARE var1 BIT DEFAULT NULL;
  SELECT * FROM t OFFSET var1 ROWS;
END;

Result: Success, all rows are returned.
Sometimes non-integer variables are illegal, but here a BIT variable is legal.
MariaDB calls BIT a numeric data type but SELECT * FROM t OFFSET b’1′ ROWS wouldn’t work.

SET sql_mode='oracle';
CREATE PROCEDURE p2 AS
  var1 NUMBER := 1;
  BEGIN
    SELECT * FROM t OFFSET var1 rows;
  END;

(Don’t forget to restore sql_mode later.)
Result: Error 1691 (HY000) A variable of a non-integer based type in LIMIT clause
That’s right for standard SQL (“The declared type … shall be an exact numeric with scale 0”) but Oracle says it might be more forgiving (“If rowcount includes a fraction, then the fractional portion is truncated.”)

SELECT * FROM t
  OFFSET 1 ROWS
  FETCH FIRST 0 ROWS ONLY;

Result: Success, zero rows are returned.
That’s wrong for standard SQL, which would return “data exception — invalid row count in fetch first clause”.

SELECT * FROM t
  FETCH FIRST 1 ROWS ONLY;

Result: Success.
But the manual’s BNF is

OFFSET start { ROW | ROWS }
FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES }

which might suggest to newbies that the OFFSET clause is compulsory before FETCH.

SET @x = 1;
SELECT * FROM t OFFSET @x ROWS;

Result: Syntax error.
I can find, by saying CREATE TABLE .. AS SELECT @x; and then saying SHOW CREATE TABLE or SELECT … FROM INFORMATION_SCHEMA, that @x is BIGINT. BIGINT declared variables are legal, and I can use variables that start with @ in some other situations. This is how things go when variables have variable types.

SHOW WARNINGS OFFSET 1 ROW;

Result: Success. I believe this is reasonable, and of course any non-standard syntax can be handled in an implementor-defined way. But it’s not documented on the documentation page for SHOW WARNINGS.

BEGIN NOT ATOMIC
  DECLARE only INT DEFAULT 2e2;
  DECLARE row INT DEFAULT (SELECT 5);
  SELECT * FROM t OFFSET ROW ROW FETCH FIRST ONLY ROW ONLY;
END;

Result: Success.
In standard SQL:2016 ROW and ONLY are reserved words.
MariaDB tends to accept goofiness since it’s not your nanny, but I believe in egregious cases the standard strictness is preferable after reasonable notice.

10.

SELECT * FROM t
  FETCH FIRST 1 ROWS ONLY ROWS EXAMINED 1;

Result: Error.
Another inconsistency, since LIMIT 1 ROWS EXAMINED 1 is legal.

11.

VALUES (1),(2),(3) OFFSET 1 ROW FETCH FIRST ROW ONLY;

Result: Success.
This is fine, but once again undocumented.

ocelotgui support

The Ocelot GUI client recognizes MariaDB syntax when connected to a MariaDB server, so if you need features like autocomplete after OFFSET to be correct then download it from github.

pgulutzan, December 13, 2022. Category: MariaDB, Standard SQL, Uncategorized.

The SQL substring function

I’ll cover SUBSTRING’s variants, its behaviour in odd cases, and what can go wrong.
I’ll add a C program that emulates what the SQL:2016 standard requires.

Ordinary case

SUBSTRING(value-expression FROM start-position [FOR string-length])

I expect that everyone knows the ordinary case: value-expression should be either a character string or a binary string, start-position and string-length should be integers, and SUBSTRING(‘abc’ FROM 2 FOR 2) will return ‘bc’.

Variation: SUBSTR and/or commas

SUBSTR(value-expression, start-position [, string-length])

Abbreviating SUBSTRING to SUBSTR, with commas instead of words to separate arguments, is Oracle and SQLite syntax. MySQL cheerily accepts both syntaxes and treats them as synonyms.

Informix accepts both but does not treat them as synonyms. DB2 has something similar.

Tarantool has SUBSTR() but will soon change to SUBSTRING().

So when you have a choice, you just have to decide: do you want to be compatible with Oracle, or with almost everybody else?

Variation: passing a non-string and non-integers

In MySQL/MariaDB I can say SUBSTRING(123, 1.5, 1.1) and get ‘2’. The standard requirement is: the first argument must be a string and the others must be exact with scale 0. So rather than accepting decimals and rounding, others will call this an error.

What we’re seeing here is the typical MySQL/MariaDB idea that, if it’s possible to do something with the input, do it. And in this case it’s possible by converting the first argument to a string and rounding the next arguments.

But it’s not a strictly MySQL thing, Oracle and DB2 will also accept non-integers and do implicit conversions.

Variation: negative start-position

This is probably the most frequent variation.

What is SUBSTRING(‘abc’, -2, 2)?

The standard is 100% clear (well, as clear as it ever gets): start-position is 2 characters before the start of the string, and string start is 1, so
SUBSTRING(‘abc’ FROM -2 FOR 2) is a zero-length string but
SUBSTRING(‘abc’ FROM -2 FOR 4) is ‘a’.
SQL Server follows this though I don’t know whether it always does so.

However — perhaps because programmers are used to seeing such stuff in php and perl and Lua — the alternative is to say that negative start-positions are counted backwards from the end of the string, so -1 is the last character, -2 is the second-last, and so on.

So in MySQL SUBSTRING(‘abc’ FROM -2 FOR 2) is ‘bc’ and
SUBSTRING(‘abc’ FROM -2 FOR 4) is also ‘bc’.
But don’t hope for consistency: SUBSTRING(‘abc’ FROM -4 FOR 5) is just a zero-length string.

I’ll admit that the count-backwards variation is popular, since it’s in Oracle and SQLite and MySQL/MariaDB (and in Informix SUBSTR, which is the difference I mentioned earlier). But it’s awfully easy to avoid: use a different function if your DBMS supports RIGHT() or INSTR(), or say
SUBSTRING(‘abc’ FROM LENGTH(‘abc’) – 2 FOR 2).

For that last suggestion, I wondered: won’t that take longer? So I filled a table with long strings.

CREATE TABLE j (s1 TEXT(1000000));
CREATE PROCEDURE p()
BEGIN
  DECLARE i INTEGER DEFAULT 0;
  WHILE i < 737 DO
    INSERT INTO j VALUES (REPEAT('abc',1000000));
    SET i = i + 1;
  END WHILE;
END;
CALL p();
SELECT SUBSTRING(s1 FROM -2 FOR 2) FROM j;
SELECT SUBSTRING(s1 FROM LENGTH(s1)-2 FOR 2) FROM j;

The SELECT with “FROM -2” takes on average about 12 seconds on my laptop.
The SELECT with “FROM LENGTH(s1)-2” takes on average about 7 seconds.
Your mileage will vary but if you get anything vaguely similar you’ll have to conclude that non-standard negative start positions are unnecessary.

Variation: going past the end

Of course SUBSTRING(‘abc’ FROM 3 FOR 5) is ‘c’ and the nonexistent characters past the end don’t matter. That’s standard, and it’s why I found it inconsistent when I saw different behaviour for nonexistent characters before the start.

But there’s another possible answer: the DB2 incomprehensible result. The documentation for SUBSTR() says that the result is an error, and simultaneously that the string is padded on the right with spaces or zero bytes.

Variation: CHAR in, VARCHAR out

If the input is CHAR, is the result supposed to be CHAR?

The standard is once again clear about this:
“If the declared type of <character value expression> is fixed-length character string or variable-length character string, then DTCSF is a variable-length character string type with maximum length equal to the fixed length or maximum length of DTCVE.”

For Oracle the result is VARCHAR if the input is CHAR.

For SQL Server the result is VARCHAR if the input is CHAR.

For DB2 the result is VARCHAR if the input is CHAR.

For MySQL the result is undocumented so I used my usual way of finding out what the data type of a result is …

CREATE TABLE j (s1 CHAR(5));
CREATE TABLE j2 AS SELECT SUBSTRING(s1 FROM 2 FOR 2) FROM j;
SELECT table_name, data_type FROM information_Schema.columns WHERE column_name = 's1';

Result: ‘char’. Yes, I know, SHOW CREATE TABLE j2; would say ‘varchar’, but I don’t rely on it here. If you set sql_mode=’pad_char_to_full_length’; and then select length(column_name) from all rows in the table, you’ll see it’s always the defined length.

Variation: octets

SUBSTRING(… FROM … FOR … [USING CHARACTERS|OCTETS])

By now you’ve gathered that my main concern is SUBSTRING for characters more than SUBSTRING for binaries, but the standard says I can specify the position and length in bytes rather than characters.

So, if you’ve got a multi-byte character, and you use OCTETS, you can split it so that the result is not valid characters. For example: starting with ‘Д’ in UTF-8 (which is encoded as 0D94):
SUBSTRING(‘Д’ FROM 2 FOR 1 USING OCTETS)
is X’94’ which is not a UTF-8 character that was in the input.

DB2 covers this up by replacing invalid character fragments with spaces.

For the standard, for the newer substring function added in SQL:2016, I read:
“If [character length units] is OCTETS and the [relevant] octet of STR is not the first octet of a character, then the result of [this function] is implementation-dependent.”
That certainly seems like the appropriate thing to say about all substring functions.

Variation: added in SQL:2016

These SUBSTRING syntaxes which are part of SQL:2016 might someday be interesting:

<regular expression substring function> ::=
SUBSTRING <left paren> <character value expression> SIMILAR <character value expression>
      ESCAPE <escape character> <right paren>

<regex substring function> ::=
  SUBSTRING_REGEX <left paren>
      <XQuery pattern> [ FLAG <XQuery option flag> ]
      IN <regex subject string>
      [ FROM <start position> ]
      [ USING <char length units> ]
      [ OCCURRENCE <regex occurrence> ]
      [ GROUP <regex capture group> ]
      <right paren>

… but as of today they seem to have not caught fire in popular imagination.

Variation: zero-length strings can be errors

The obsolete manual for Oracle 10g says “When you specify a value that is less than 1, the function returns NA.” which I guess means it was an error long ago. Now the return is null because Oracle doesn’t distinguish it from a zero-length string.

In a draft of an earlier SQL standard, one might read:
“16) If the result of <string value expression> is a zero-length character string, then it is implementation-defined whether an exception condition is raised: data exception — zero-length character string”

But that, in the section about <string value function>, didn’t fit well. It was replaced by
“16) If the result of <string value function> is the zero-length character string or the zero-length binary string, then it is implementation-defined whether an exception condition is raised: data exception – zero-length character string or data exception – zero-length binary string, respectively.”

So: if the result of SUBSTRING() is ”, then it’s not a violation of the standard to return an error. However, I don’t expect that anybody would say that unless they did the same for everything, not just SUBSTRING().

C program that does what the standard says

I said the standard document is clear, but it is also intimidating. The main clause about SUBSTRING for characters looks like this:

3)If <character substring function> is specified, then:
  a) If the character encoding form of <character value expression> is UTF8, UTF16, or UTF32, then, in
     the remainder of this General Rule, the term "character" shall be taken to mean "unit specified by
     <char length units>".
  b) Let C be the value of the <character value expression>, let LC be the length in characters of C, and
     let S be the value of the <start position>.
  c) If <string length> is specified, then let L be the value of <string length> and let E be S+L. Otherwise,
     let E be the larger of LC + 1 and S.
  d) If at least one of C, S, and L is the null value, then the result of the <character substring function> is
     the null value.
  e) If E is less than S, then an exception condition is raised: data exception - substring error.
  f) Case:
      i)  If S is greater than LC or if E is less than 1 (one), then the result of the <character substring
          function> is the zero-length character string.
      ii) Otherwise,
          1) Let S1 be the larger of S and 1 (one). Let E1 be the smaller of E and LC+1. Let L1 be
             E1-S1.
          2) The result of the <character substring function> is a character string containing the L1
             characters of C starting at character number S1 in the same order that the characters appear
             in C.

It’s not something that humans can grasp immediately. But notice that, in English, it’s declaring variables and conditional-execution statements. So for testing purposes it’s straightforward to convert the English into C. That’s what I’ve done for this section. Here is a stand-alone program that replicates all the important parts of the clause so that anyone can see what SUBSTRING() would produce for any combination of
SUBSTRING(value-expression FROM start-position [FOR string-length])

/*
  Simulate ISO 9075-2:2016 description of substring
  By Peter Gulutzan 2021-08-30
  Copyright (c) 2021 by Peter Gulutzan. All rights reserved.
  To compile: gcc -o substring substring.c
  To run, with 2 or 3 args: substring character_value_expression start_position [string_length]
  Assumption: start_position and string_length are integers, there is no validity check.
  Convention: if character_value_expression is - then we treat it as an empty string.

  Example:
  pgulutzan@pgulutzan-VirtualBox:~/tarantool_sandbox$ ./substring 'abc' -1 1
  character_value_expression: abc.
  start_position: -1
  string_length: 1
  LC: 3
  S: -1
  L: 1
  E: 0
  S > LC or E < 1. Return zero-length string
*/

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <stdbool.h>

int main( int argc, char *argv[] )
{
  if ((argc < 3) || (argc > 4))
  {
    printf("We want: character_value_expression start_position string_length\n");
    exit(0);
  }
  bool is_length_specified;

  char *character_value_expression= argv[1];

  int start_position= atoi(argv[2]); /* no validity check */

  int string_length;
  if (argc == 3)
  {
    is_length_specified= false;
  }
  else
  {
    is_length_specified= true;
    string_length= atoi(argv[3]); /* no validity check */
  }

  printf("character_value_expression: %s.\n", character_value_expression);
  printf("start_position: %d\n", start_position);
  if (is_length_specified == false)
  {
    printf("string_length: not specified\n");
  }
  else
  {
    printf("string_length: %d\n", string_length);
  }

  char *C= character_value_expression;
  int LC= -99, S= -99, E= -99, L= -99, S1= -99, E1= -99, L1= -99;

  if (strcmp(character_value_expression, "-") == 0) LC= 0;
  else LC= strlen(C);

  printf("LC: %d\n", LC);

  S= start_position;
  /* c)   If <string length> is specified, then let L be the value of <string length> and let E be S+L. Otherwise,
          let E be the larger of LC + 1 and S. */
  if (is_length_specified == true)
  {
    L = string_length;
    E= S + L;
  }
  else
  {
    if ((LC + 1) > S) E= LC + 1;
    else E= S;
  }

  printf("S: %d\n", S);
  printf("L: %d\n", L);
  printf("E: %d\n", E);

  /* e) If E is less than S, then an exception condition is raised: data exception -- substring error. */
  if (E < S)
  {
    printf("E < S. data exception -- substring error.\n");
    exit(0);
  }
  /* f) i) If S is greater than LC or if E is less than 1 (one), then the result of the <character substring
           function> is the zero-length character string. */
  if ((S > LC) || (E < 1))
  {
    printf("S > LC or E < 1. Return zero-length string\n");
    exit(0);
  }

  /* f) ii) 1) Let S1 be the larger of S and 1 (one). Let E1 be the smaller of E and LC+1. Let L1 be
               E1-S1. */
  if (S > 1) S1= S;
  else S1= 1;
  if (E < (LC + 1)) E1= E;
  else E1= LC + 1;
  L1= E1 - S1;

  printf("S1: %d\n", S1);
  printf("E1: %d\n", E1);
  printf("L1: %d\n", L1);

  /* f) ii) 2 The result of the <character substring function> is a character string containing the L1
              characters of C starting at character number S1 in the same order that the characters appear
              in C. */
  for (int i= S1 - 1; i < (S1 - 1) + L1; ++i)
  {
    printf("%c\n", *(C + i));
  }
}

ocelotgui progress

The open-source GUI client program is getting constantly better. The next big feature will be “Export” to a variety of formats including delimited, boxed, html, and default. The preliminary code has been uploaded to our github repository.

The README for the current version (1.5) is here for MySQL/MariaDB.
Or, here for Tarantool.

pgulutzan, September 2, 2021. Category: Standard SQL, Uncategorized.