Month: December 2014


What is a data type?

I’d suppose that these statements are generally accepted:

A data type is a set of values.

So any value that belongs to that set is said to “have” or “belong to” that data type.

And any column which is defined as “of” that data type is going to contain values which have that data type, only.

For example a column C may be defined as of type INTEGER and will only contain values which belong to the set of integers.

And now for the exceptions, details, caveats, and errors.

What is synonymous with data type?

There are no synonyms; no other term should ever be used.

Not datatype. Oracle’s manual has “datatype” but IBM’s doesn’t and Microsoft’s doesn’t, and it’s not in the standard.

Not domain. C.J.Date (Date on Database, Writings 2000-2006) says

I observe that some people don’t understand even yet that domains and types are the same thing. The SQL standard muddies the water, too, because it used the term DOMAIN to mean something else …”

From your perspective the implication should be: you can’t use domain as a synonym because users of SQL products will think you mean something else.

Not class. Although some people uses class and type interchangeably, class can also mean a set of objects (rather than just a set of values), so it’s a less precise term.

What does the standard say

Move now to what’s in the SQL:2011 standard document. We find:

A data type is a set of representable values. Every representable value belongs to at least one data type and some belong to several data types.

This is restating the common definition — “a data type is a set of values” — without significant limitations. By saying “representable”, the standard seems to be merely accepting that a DBMS shouldn’t include what it can’t store or display. By saying “several data types”, it’s acknowledging that the value denoted by the literal 9999 could fit in either INTEGER or SMALLINT.

Does a data type include length, character set, etc.?

Suppose the definition of a column is

COLUMN_NAME CHARACTER(32)

Is the data type CHARACTER, or is the data type CHARACTER(32)?

This is a difficult question. The best answer I can give, which I’ll claim is what the SQL:2011 standard would say if it could, is: “CHARACTER(32)”.

Yes the name of the data type is CHARACTER. The standard only defines a small set of data type names — CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT, BINARY, BINARY VARYING, BINARY LARGE OBJECT, NUMERIC, DECIMAL, SMALLINT, INTEGER, BIGINT, FLOAT, REAL, DOUBLE PRECISION, BOOLEAN, DATE, TIME, TIMESTAMP, INTERVAL — and of course “CHARACTER” is on that list but “CHARACTER(32)” is not, it is not a name.

On the other hand — and I think the standard’s vocabulary choice is unfortunate — the data type’s name does not identify the data type! The data type is actually identified by a “data type descriptor” which is defined with the entire <data type> clause, and which, as well as name, includes precision, character set, etc.

That is not explicitly stated, so I’ll support the interpretation with several quotes which show that it’s implicit.

“A data type is predefined even though the user is required (or allowed) to provide certain parameters when specifying it (for example the precision of a number).”

“Two data types,T1 and T2, are said to be compatible if T1 is assignable to T2,T2 is assignable to T1, and their descriptors include the same data type name.”

“The data types “CHARACTER(n) CHARACTER SET CS1” and “CHARACTER(m) CHARACTER SET CS2”,where CS1 ≠ CS2, have descriptors that include the same data type name (CHARACTER), but are not mutually assignable”

“If a <data type> is specified, then let [the data type of a domain] be the data type identified by <data type>.”

“If <data type> is specified, then [the declared type of the column] is that data type.”

“If the maximum length of the subject data type is fixed [then default literals shall not be too long].”

“All data types in [destination strings] shall be character string, and all of them shall have the same character repertoire.”

“[If the result of an operation is] not exactly representable with the precision and scale of the result data type [then it’s an error].”

“A site declared with a character string type may be specified as having a collation, which is treated as part of its data type.”

… I quote them merely for the sake of showing that length and character set and repertoire are all part of data type, and that the <data type> clause is what specifies or identifies the data type, and that two different data types can both have the same data type name.

This does not mean that CHARACTER(32) is a subtype of CHARACTER, because the standard is clear that predefined data types cannot have subtypes or supertypes. (User-defined types can have subtypes or supertypes; however, for user-defined types questions like “is the data type character(32)?” don’t arise.)

This does not mean that saying “the data type is CHARACTER” is wrong, because, whenever we’re talking abstractly or generally, that’s appropriate.

It does mean that “data type is CHARACTER(32)” is much more correct when the definition is known.

What about funny-looking values?

If a data type is supposed to be a set of values, but NULLs are allowed, then what is the set?

Taking the line that NULL represents UNKNOWN, and using MySQL/MariaDB limits for SMALLINT, we could say that the set is {-32768, -32767, -32766, … 0, …, +32767} — NULL is somewhere in that enumeration, we just don’t happen to know where. But NULL could mean other things besides UNKNOWN, and sometimes it definitely does — think of ROLLUP. The standard just says “Every data type has a special value, called the null value, denoted by the keyword NULL.” The set then is {NULL, -32768, -32767, -32766, … 0, …, +32767}.

If the limits of the floating-point sets are supposed to be “based on the IEEE standard”, as the MySQL manual says they are, then where are NaN (not a number) and INF (pseudo-infinity)?

This has been the subject of feature requests like Bug#57519 Documentation for IEEE 754 compliance is missing from Daniël van Eeden, and Bug#41442 Support ieee754 by supporting INF and NaN from Mark Callaghan.

Bug#41442, incidentally, says “the SQL spec doesn’t allow [INF and NaN]”. I’d put it less severely — the standard effectively disallows them for all the standard predefined numeric data types including FLOAT or DOUBLE PRECISION (for example it says that an approximate-numeric literal must have the form <mantissa>E<exponent>). The standard does not forbid new implementation-defined data types.

Other specific effects on our favourite DBMSs

As for “what is a data type in MySQL (or by extension in MariaDB)?”, the
correct answer is: whatever the MySQL or MariaDB documentation says. There used to be arguments about this, but I think they’re all in the past now.

Equally, there’s no point arguing about how TINYBLOB etc. should have been specified in terms of maximum length or precision, or about how ENUM should have been a [VAR]CHAR plus a constraint. The more feasible feature request would be along the lines of: more complete support for the standard types.

General Purpose Storage Engines in MariaDB

MariaDB tries to encourage use of many storage engines, which Oracle doesn’t bother with. The ones that could be considered — occasionally! — as InnoDB replacements are: Mroonga, TokuDB, and Aria.

Mroonga

I predicted that Mroonga would be in MariaDB 10.0.8, and behold, here it is in MariaDB 10.0.15. Truly I am the Nostradamus of the database world.

It’s a general-purpose storage engine, in that it doesn’t have serious limitations regarding what can be stored or indexed. I think its “column-store” feature has limited appeal, but as I saw in January, it’s dandy for full-text searching. This time I look at the list of “Full-Text Restrictions” in the MySQL manual, and quote from it:

Full-text searches are supported for InnoDB and MyISAM tables only.” — Not true for Mroonga, of course.

Full-text searches are not supported for partitioned tables.” — True for Mroonga, which can’t handle partitioned tables at all.

… the utf8 character set can be used, but not the ucs2 [or utf16 or utf32] character set.” — True for Mroonga.

Therefore, the FULLTEXT parser cannot determine where words begin and end in [Chinese/Japanese/Korean i.e. CJK].” — Not true for Mroonga, which is just great with CJK.

… all columns in a FULLTEXT index must use the same character set and collation.” — True for Mroonga.

The argument to AGAINST() must be a string value that is constant during query evaluation.” — True for Mroonga.

For InnoDB, all DML operations (INSERT, UPDATE, DELETE) involving columns with full-text indexes are processed at transaction commit time.” — Technically not true for Mroonga, but Mroonga indexes will get out of synch if rollbacks occur.

TokuDB

In a previous escapade I found that TokuDB was good stuff but lacked two features: foreign keys and full-text search.

So, inspired by the fact that I can use Mroonga full-text searching with InnoDB, I wondered: wouldn’t it be great if I could use that trick on TokuDB too?

Well, I gave it a try, and have to say the result was a bit disappointing …

MariaDB [test]> CREATE TABLE t (
    ->   id INT NOT NULL DEFAULT '0',
    ->   `text_column` VARCHAR(100) CHARACTER SET utf8 DEFAULT NULL,
    ->   PRIMARY KEY (id),
    ->   FULLTEXT KEY `text_column` (text_column)
    -> ) ENGINE=mroonga DEFAULT CHARSET=latin1 COMMENT='engine "tokudb"'
    -> ;
ERROR 2013 (HY000): Lost connection to MySQL server during query

… Do not do this on a production system, as it will disable all your databases.

I also have a MariaDB 10.0.15 server that I built from source, and on that server TokuDB crashes even for simple everyday statements. I suppose this is like saying “Doctor my head hurts when I beat it against the wall”. But there’s no warning during the build and no warning in the documentation. Therefore it must be somebody else’s fault.

Aria

In 2013 Stewart Smith said about Aria: “It’s still not there and I don’t think it ever will be.” I didn’t see anything in 2014 that contradicts what he predicts, and the MariaDB people themselves say improvements are “on hold”, but it does have some advantages over MyISAM.

This is another storage engine with full-text support, but I rate that feature as “poor” because (like InnoDB) it won’t do CJK well, and (unlike InnoDB) it has some irritating default settings about minimum key length and maximum frequency.

It looks nice that I can create a SPATIAL index with Aria; however, I see that this is supposedly possible with InnoDB too, according to the recent announcement of MySQL 5.7.5.

Data can be represented in tables

Capability Mroonga TokuDB Aria InnoDB
Full-text Indexes Excellent No Poor Poor
Foreign Keys No No No Yes
Rollbacks No Yes No Yes
Spatial Indexes Yes No Yes Real Soon
Maximum Key Length 3072 3072 1000 767
Smallest Lock Column Row Table Row
Allows Partition Clauses No Yes Yes Yes
Recovers after ‘kill -9’ No Yes Yes Yes
Works on Windows Yes No Yes Yes

Or, if you want to go beyond general-purpose open-source storage engines like these, there are lots more to look at. The biggest list of MySQL-compatible storage engines that I know of is in the Spanish Wikipedia.

UPDATE NOTICE: The chart above has been corrected due to comments from a Mroonga developer and a TokuDB developer.