Month: September 2021
The SQL substring function
I’ll cover SUBSTRING’s variants, its behaviour in odd cases, and what can go wrong.
I’ll add a C program that emulates what the SQL:2016 standard requires.
Ordinary case
SUBSTRING(value-expression FROM start-position [FOR string-length])
I expect that everyone knows the ordinary case: value-expression should be either a character string or a binary string, start-position and string-length should be integers, and SUBSTRING(‘abc’ FROM 2 FOR 2) will return ‘bc’.
Variation: SUBSTR and/or commas
SUBSTR(value-expression, start-position [, string-length])
Abbreviating SUBSTRING to SUBSTR, with commas instead of words to separate arguments, is Oracle and SQLite syntax. MySQL cheerily accepts both syntaxes and treats them as synonyms.
Informix accepts both but does not treat them as synonyms. DB2 has something similar.
Tarantool has SUBSTR() but will soon change to SUBSTRING().
So when you have a choice, you just have to decide: do you want to be compatible with Oracle, or with almost everybody else?
Variation: passing a non-string and non-integers
In MySQL/MariaDB I can say SUBSTRING(123, 1.5, 1.1) and get ‘2’. The standard requirement is: the first argument must be a string and the others must be exact with scale 0. So rather than accepting decimals and rounding, others will call this an error.
What we’re seeing here is the typical MySQL/MariaDB idea that, if it’s possible to do something with the input, do it. And in this case it’s possible by converting the first argument to a string and rounding the next arguments.
But it’s not a strictly MySQL thing, Oracle and DB2 will also accept non-integers and do implicit conversions.
Variation: negative start-position
This is probably the most frequent variation.
What is SUBSTRING(‘abc’, -2, 2)?
The standard is 100% clear (well, as clear as it ever gets): start-position is 2 characters before the start of the string, and string start is 1, so
SUBSTRING(‘abc’ FROM -2 FOR 2) is a zero-length string but
SUBSTRING(‘abc’ FROM -2 FOR 4) is ‘a’.
SQL Server follows this though I don’t know whether it always does so.
However — perhaps because programmers are used to seeing such stuff in php and perl and Lua — the alternative is to say that negative start-positions are counted backwards from the end of the string, so -1 is the last character, -2 is the second-last, and so on.
So in MySQL SUBSTRING(‘abc’ FROM -2 FOR 2) is ‘bc’ and
SUBSTRING(‘abc’ FROM -2 FOR 4) is also ‘bc’.
But don’t hope for consistency: SUBSTRING(‘abc’ FROM -4 FOR 5) is just a zero-length string.
I’ll admit that the count-backwards variation is popular, since it’s in Oracle and SQLite and MySQL/MariaDB (and in Informix SUBSTR, which is the difference I mentioned earlier). But it’s awfully easy to avoid: use a different function if your DBMS supports RIGHT() or INSTR(), or say
SUBSTRING(‘abc’ FROM LENGTH(‘abc’) – 2 FOR 2).
For that last suggestion, I wondered: won’t that take longer? So I filled a table with long strings.
CREATE TABLE j (s1 TEXT(1000000)); CREATE PROCEDURE p() BEGIN DECLARE i INTEGER DEFAULT 0; WHILE i < 737 DO INSERT INTO j VALUES (REPEAT('abc',1000000)); SET i = i + 1; END WHILE; END; CALL p(); SELECT SUBSTRING(s1 FROM -2 FOR 2) FROM j; SELECT SUBSTRING(s1 FROM LENGTH(s1)-2 FOR 2) FROM j;
The SELECT with “FROM -2” takes on average about 12 seconds on my laptop.
The SELECT with “FROM LENGTH(s1)-2” takes on average about 7 seconds.
Your mileage will vary but if you get anything vaguely similar you’ll have to conclude that non-standard negative start positions are unnecessary.
Variation: going past the end
Of course SUBSTRING(‘abc’ FROM 3 FOR 5) is ‘c’ and the nonexistent characters past the end don’t matter. That’s standard, and it’s why I found it inconsistent when I saw different behaviour for nonexistent characters before the start.
But there’s another possible answer: the DB2 incomprehensible result. The documentation for SUBSTR() says that the result is an error, and simultaneously that the string is padded on the right with spaces or zero bytes.
Variation: CHAR in, VARCHAR out
If the input is CHAR, is the result supposed to be CHAR?
The standard is once again clear about this:
“If the declared type of <character value expression> is fixed-length character string or variable-length character string, then DTCSF is a variable-length character string type with maximum length equal to the fixed length or maximum length of DTCVE.”
For Oracle the result is VARCHAR if the input is CHAR.
For SQL Server the result is VARCHAR if the input is CHAR.
For DB2 the result is VARCHAR if the input is CHAR.
For MySQL the result is undocumented so I used my usual way of finding out what the data type of a result is …
CREATE TABLE j (s1 CHAR(5)); CREATE TABLE j2 AS SELECT SUBSTRING(s1 FROM 2 FOR 2) FROM j; SELECT table_name, data_type FROM information_Schema.columns WHERE column_name = 's1';
Result: ‘char’. Yes, I know, SHOW CREATE TABLE j2; would say ‘varchar’, but I don’t rely on it here. If you set sql_mode=’pad_char_to_full_length’; and then select length(column_name) from all rows in the table, you’ll see it’s always the defined length.
Variation: octets
SUBSTRING(… FROM … FOR … [USING CHARACTERS|OCTETS])
By now you’ve gathered that my main concern is SUBSTRING for characters more than SUBSTRING for binaries, but the standard says I can specify the position and length in bytes rather than characters.
So, if you’ve got a multi-byte character, and you use OCTETS, you can split it so that the result is not valid characters. For example: starting with ‘Д’ in UTF-8 (which is encoded as 0D94):
SUBSTRING(‘Д’ FROM 2 FOR 1 USING OCTETS)
is X’94’ which is not a UTF-8 character that was in the input.
DB2 covers this up by replacing invalid character fragments with spaces.
For the standard, for the newer substring function added in SQL:2016, I read:
“If [character length units] is OCTETS and the [relevant] octet of STR is not the first octet of a character, then the result of [this function] is implementation-dependent.”
That certainly seems like the appropriate thing to say about all substring functions.
Variation: added in SQL:2016
These SUBSTRING syntaxes which are part of SQL:2016 might someday be interesting:
<regular expression substring function> ::= SUBSTRING <left paren> <character value expression> SIMILAR <character value expression> ESCAPE <escape character> <right paren> <regex substring function> ::= SUBSTRING_REGEX <left paren> <XQuery pattern> [ FLAG <XQuery option flag> ] IN <regex subject string> [ FROM <start position> ] [ USING <char length units> ] [ OCCURRENCE <regex occurrence> ] [ GROUP <regex capture group> ] <right paren>
… but as of today they seem to have not caught fire in popular imagination.
Variation: zero-length strings can be errors
The obsolete manual for Oracle 10g says “When you specify a value that is less than 1, the function returns NA.” which I guess means it was an error long ago. Now the return is null because Oracle doesn’t distinguish it from a zero-length string.
In a draft of an earlier SQL standard, one might read:
“16) If the result of <string value expression> is a zero-length character string, then it is implementation-defined whether an exception condition is raised: data exception — zero-length character string”
But that, in the section about <string value function>, didn’t fit well. It was replaced by
“16) If the result of <string value function> is the zero-length character string or the zero-length binary string, then it is implementation-defined whether an exception condition is raised: data exception – zero-length character string or data exception – zero-length binary string, respectively.”
So: if the result of SUBSTRING() is ”, then it’s not a violation of the standard to return an error. However, I don’t expect that anybody would say that unless they did the same for everything, not just SUBSTRING().
C program that does what the standard says
I said the standard document is clear, but it is also intimidating. The main clause about SUBSTRING for characters looks like this:
3)If <character substring function> is specified, then: a) If the character encoding form of <character value expression> is UTF8, UTF16, or UTF32, then, in the remainder of this General Rule, the term "character" shall be taken to mean "unit specified by <char length units>". b) Let C be the value of the <character value expression>, let LC be the length in characters of C, and let S be the value of the <start position>. c) If <string length> is specified, then let L be the value of <string length> and let E be S+L. Otherwise, let E be the larger of LC + 1 and S. d) If at least one of C, S, and L is the null value, then the result of the <character substring function> is the null value. e) If E is less than S, then an exception condition is raised: data exception - substring error. f) Case: i) If S is greater than LC or if E is less than 1 (one), then the result of the <character substring function> is the zero-length character string. ii) Otherwise, 1) Let S1 be the larger of S and 1 (one). Let E1 be the smaller of E and LC+1. Let L1 be E1-S1. 2) The result of the <character substring function> is a character string containing the L1 characters of C starting at character number S1 in the same order that the characters appear in C.
It’s not something that humans can grasp immediately. But notice that, in English, it’s declaring variables and conditional-execution statements. So for testing purposes it’s straightforward to convert the English into C. That’s what I’ve done for this section. Here is a stand-alone program that replicates all the important parts of the clause so that anyone can see what SUBSTRING() would produce for any combination of
SUBSTRING(value-expression FROM start-position [FOR string-length])
/* Simulate ISO 9075-2:2016 description of substring By Peter Gulutzan 2021-08-30 Copyright (c) 2021 by Peter Gulutzan. All rights reserved. To compile: gcc -o substring substring.c To run, with 2 or 3 args: substring character_value_expression start_position [string_length] Assumption: start_position and string_length are integers, there is no validity check. Convention: if character_value_expression is - then we treat it as an empty string. Example: pgulutzan@pgulutzan-VirtualBox:~/tarantool_sandbox$ ./substring 'abc' -1 1 character_value_expression: abc. start_position: -1 string_length: 1 LC: 3 S: -1 L: 1 E: 0 S > LC or E < 1. Return zero-length string */ #include <stdlib.h> #include <stdio.h> #include <string.h> #include <stdbool.h> int main( int argc, char *argv[] ) { if ((argc < 3) || (argc > 4)) { printf("We want: character_value_expression start_position string_length\n"); exit(0); } bool is_length_specified; char *character_value_expression= argv[1]; int start_position= atoi(argv[2]); /* no validity check */ int string_length; if (argc == 3) { is_length_specified= false; } else { is_length_specified= true; string_length= atoi(argv[3]); /* no validity check */ } printf("character_value_expression: %s.\n", character_value_expression); printf("start_position: %d\n", start_position); if (is_length_specified == false) { printf("string_length: not specified\n"); } else { printf("string_length: %d\n", string_length); } char *C= character_value_expression; int LC= -99, S= -99, E= -99, L= -99, S1= -99, E1= -99, L1= -99; if (strcmp(character_value_expression, "-") == 0) LC= 0; else LC= strlen(C); printf("LC: %d\n", LC); S= start_position; /* c) If <string length> is specified, then let L be the value of <string length> and let E be S+L. Otherwise, let E be the larger of LC + 1 and S. */ if (is_length_specified == true) { L = string_length; E= S + L; } else { if ((LC + 1) > S) E= LC + 1; else E= S; } printf("S: %d\n", S); printf("L: %d\n", L); printf("E: %d\n", E); /* e) If E is less than S, then an exception condition is raised: data exception -- substring error. */ if (E < S) { printf("E < S. data exception -- substring error.\n"); exit(0); } /* f) i) If S is greater than LC or if E is less than 1 (one), then the result of the <character substring function> is the zero-length character string. */ if ((S > LC) || (E < 1)) { printf("S > LC or E < 1. Return zero-length string\n"); exit(0); } /* f) ii) 1) Let S1 be the larger of S and 1 (one). Let E1 be the smaller of E and LC+1. Let L1 be E1-S1. */ if (S > 1) S1= S; else S1= 1; if (E < (LC + 1)) E1= E; else E1= LC + 1; L1= E1 - S1; printf("S1: %d\n", S1); printf("E1: %d\n", E1); printf("L1: %d\n", L1); /* f) ii) 2 The result of the <character substring function> is a character string containing the L1 characters of C starting at character number S1 in the same order that the characters appear in C. */ for (int i= S1 - 1; i < (S1 - 1) + L1; ++i) { printf("%c\n", *(C + i)); } }
ocelotgui progress
The open-source GUI client program is getting constantly better. The next big feature will be “Export” to a variety of formats including delimited, boxed, html, and default. The preliminary code has been uploaded to our github repository.
The README for the current version (1.5) is here for MySQL/MariaDB.
Or, here for Tarantool.