Month: October 2024
Work around a MariaDB behaviour change
CREATE PROCEDURE p() BEGIN DECLARE v CHAR CHARACTER SET utf8 DEFAULT ''; SELECT * FROM information_schema.routines WHERE routine_name > v; END; CALL p();
Result: with current MariaDB 11.4 or MySQL 8.3: a result set.
Result: with new or future MariaDB: an error message mentioning collations.
The error appears with MariaDB 11.5.1 and later, except for Debian-based distros such as Ubuntu where it appears with MariaDB 11.5.2 and later. It is not a bug, it is a behaviour change.
I’ll describe the error’s cause, the rare situations where the change breaks existing applications, the ways to check, and a workaround.
Assume routine’s character set is Unicode
Let’s say I installed a DBMS and created a UTF8 database. Then I can look at the database’s defaults with
SELECT * FROM information_schema.schemata;
If the server is MySQL 8.3:
The database that I created has default_character_set_name=utf8mb4, default_collation_name = utf8mb4_0900_ai_ci. System databases have default_character_set_name=utf8mb3, default_collation_name = utf8mb3_general_ci.
If the server is MariaDB 11.4:
The database that I created has default_character_set_name=utf8mb3, default_collation_name = utf8mb3_general_ci. System databases have default_character_set_name=utf8mb3, default_collation_name = utf8mb3_general_ci.
If the server is MariaDB 11.6:
The database that I created has default_character_set_name=utf8mb3, default_collation_name = utf8mb3_uca1400_ai_ci. System databases have default_character_set_name=utf8mb3, default_collation_name = utf8mb3_general_ci.
(A system database is sys, performance_schema, information_schema, or mysql.)
The difference between utf8mb3 and utf8mb4 is usually not important. Maybe bigger is better, but I looked at an admittedly-old stack overflow question which established that the extra characters are mostly Gothic, emoji, new Chinese personal names, and some mathematical symbols. At the moment you can’t create routines which require 4-byte UTF8 anyway which means the effect for system tables is nothing.
The difference between uca_0900 and uca_1400 collations is usually not important. The former refers to Unicode Collation Algorithm Version 9 (2016) the latter refers to Collation Algorithm Version 14 (2021). UCA’s version number is “synchronized with the version of the Unicode Standard which specifies the repertoire covered”, which means the order of new emojis or rarely-used scripts will differ. For example the Unicode-14 release notes mention the bronze-age Cypro-Minoan script, so if you use that, you should care.
However, the difference between latin1 and utf8 is usually so important that — I hope — many people will explicitly declare CHARACTER SET UTF8 or some variant, when creating the database or creating the routine. As the manual says: “If CHARACTER SET and COLLATE are not present, the database character set and collation in effect at routine creation time are used.”
Therefore, “DECLARE v CHAR;” will often mean “DECLARE v CHAR DEFAULT CHARACTER SET utf8;”. I’ll assume that’s the case. That’s what causes the error, since with a latin1 default it won’t happen.
The cause
The error message is due to the enhancement request Change default Unicode collation to uca1400_ai_ci which has explanations from the requester and the MariaDB collation expert about some of the things to watch out for, other than this error message. As well, you can find a condemnation of the non-UCA collations such as utf8mb3_general_ci.
The general collation is not awful, for example it knows that “B with dot above” is equal to B.
However, it sees differences where a Unicode _ai_ci collation does not. For example,
SELECT 'お' COLLATE utf8mb4_general_ci = 'オ' COLLATE utf8mb4_general_ci;
is false.
SELECT 'お' COLLATE uca1400_ai_ci = 'オ' COLLATE utf8mb4_uca1400_ai_ci;
is true.
But this makes it dangerous to change a system table. If the server manufacturer simply declared that the default collation of information_schema is henceforth UCA, then suddenly the indexes are no good and primary keys contain duplicates for people who happen to have created routines or tables with similar names.
Everybody learned their lesson about that from a long-ago fiasco. As I commented about it in a heated bug report in 2011: “Why don’t we just change the rules for utf8_general_ci, instead of introducing new collations with new rules? Well, as a matter of fact, that happened for another rule affecting German Sharp S in our version 5.1. The results were catastrophic, because collation affects order of keys in an index, and when index keys are out of the expected order then searches fail (Bug#37046 etc.). The only solution is to rebuild the index and when we have customers with multiple billion rows that’s hugely inconvenient. This change was a stupid error, we have sworn not to repeat it.”
Check routine for upgrade
This isn’t the same as CHECK TABLE … FOR UPGRADE and I don’t think it can be automated, it depends on humans and their particular situations.
Probably most straightforward upgrades will not cause a change of the default so there will be no immediate effect — until you create a routine which has the same expression as what worked before, which won’t work now.
Watch out for comparisons of variables with any columns in system tables.
Watch out for DECLARE HANDLER statements which could hide the error.
Watch out for CREATE TABLE statements which use the variable as a source because you don’t want to change their collation inadvertently.
Watch out for parameters too, as the passer’s character set might differ from the catcher’s.
Workaround: explicit collation
Add a COLLATE clause.
This will indicate the approved collation, so now the comparison will not cause an error.
That is, change
DECLARE v CHAR; /* assuming UTF8 database or explicit UTF8 at some higher level */
to
DECLARE v CHAR CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci;
But not change
WHERE routine_name > v;
to
WHERE routine_name > v COLLATE utf8mb3_general_ci;
This is a minimal change. It insists on the general collation, rather than a Unicode collation, so it’s compatible with the comparand and is less likely to cause an error and is more likely to cause an indexed search.
The COLLATE in the DECLARE clause usually will not interfere with later statements that contain COLLATE. For example, here is a procedure where there might seem to be a conflict:
CREATE PROCEDURE ae() BEGIN DECLARE v CHAR CHARACTER SET utf8mb3 COLLATE utf8mb3_general_ci; SET v = 'ä'; SELECT coercibility(v); SELECT coercibility((SELECT max(routine_name COLLATE utf8mb3_german2_ci) FROM information_schema.routines)); SELECT * FROM information_schema.routines WHERE routine_name COLLATE utf8mb3_german2_ci = v; END; CALL ae();
The nifty coercibility function will say that v is 2 (implicit) while the routine_name is 0 (explicit),
therefore the WHERE’s COLLATE clause trumps the variable’s COLLATE clause,
therefore there is no error,
therefore ‘ä’ = ‘ae’ is true,
therefore there will be one row in the result set.
The workaround has two flaws.
1. For MariaDB 11.6 there’s a feature request Change default charset from latin1 to utf8mb4 So far I haven’t seen that this has happened. But when it does, CHARACTER SET utf8 COLLATE utf8mb3_general_ci could cause trouble because utf8 won’t mean utf8mb3.
Thus COLLATE utf8mb3_general_ci may not be good for the long term, I actually prefer CAST(operand AS CHAR(n) CHARACTER SET utf8mb4) COLLATE utf8mb4_general_ci as something that can work today and can probably work tomorrow.
2. For MySQL 8.3 (which is obsolete) and a similar procedure that looked at tables, I saw the bizarre error message “Error 1267 (HY000) Illegal mix of collations (utf8mb3_tolower_ci,EXPLICIT) and (utf8mb4_0900_ai_ci,EXPLICIT) for operation ‘='” which must have been a bug since there’s no such thing as utf8mb3_tolower_ci, and it doesn’t happen for MySQL 8.4.
3. It’s extra trouble for something that you don’t need in legacy MariaDB versions. To which you might be saying, “Duh, Peter, just use MariaDB’s Executable Comment Syntax to suppress the unneeded stuff when the server isn’t the latest. Well, yes, but I don’t use non-standard syntax if I don’t have to.
Thanks to Alexander Barkov for a suggestion.
How this affected me
The ocelotgui (GUI for MySQL and MariaDB) contains a debugger for stored procedures and functions. It contains DECLARE … CHAR CHARACTER SET utf8; many times. This behaviour change broke it. This commit on GitHub shows my own workaround for it. It did not go through testing with the 11.6 preview. Yesterday I included it in the latest release.
But should the GUI warn about this when parsing CREATE PROCEDURE statements? If anyone thinks so, they can submit a bug report / feature request.