Representing Sex In Databases
The MySQL Reference Manual advises:
Use a CREATE TABLE statement to specify the layout of your table:
mysql> CREATE TABLE pet (name VARCHAR(20), owner VARCHAR(20),
-> species VARCHAR(20), sex CHAR(1), birth DATE, death DATE);
…
Several types of values can be chosen to represent sex in animal records, such as ‘m’ and ‘f’, or perhaps ‘male’ and ‘female’. It is simplest to use the single characters ‘m’ and ‘f’.
I will supplement that with this post about representing sex (gender) in people records. I have to consider the name, the possible values, the data type, and the constraints. It won’t be simplest.
Should the column name be ‘sex’?
First consider Naming Conventions. The word ‘sex’ is not too long, its characters are all Latin, there is no need to “delimit”, using lower case is okay, making it plural (‘sexes’) would be a mistake, and it’s not a reserved word. So far so good.
Next consider the alternative term ‘gender’. Some people make a distinction between gender and sex. In that case, ‘sex’ is better if only a physical characteristic matters, but ‘gender’ is better if you’re also taking into account the behaviour, self-identification, and washroom preference. However, gender’s primary dictionary meaning is as a grammatical category, and it is a far less popular word.
Conclusion: yes, the column name should be ‘sex’.
Should the possible values be ‘m’ and ‘f’?
Now consider interoperability. If there are application programs that use the database, what checkboxes are they likely to display? If the data must go to some other body (the government, say), do they tell you what the choices are? This is where standards come in. For pretty well anything that’s commonly used to describe a person (age, occupation, income, and of course sex) there will be a standard code list.
The German Wikipedia article about “Data standards for description of sex” describes many of them. It’s worth a look even if your German is weak, because most of the charts include English translations. I’ll try to use primary sources and then generalize:
- ISO 5218.
ISO/IEC 5218 Information technology — Codes for the representation of human sexes has single-digit numeric codes: 0 = not known, 1 = male, 2 = female, 9 = not applicable.
- Three-Way.
The CDC (Centers for Disease Control) prescription is ‘M’ = male, ‘F’ = female, ‘O’ = other.
On German birth certificates, starting in November the choices will be ‘M’, ‘F’, ‘ ‘.
For Australian passports the choices are ‘M’, ‘F’, ‘X’.
- European.
The Eurostat statistics group of the European Commission, and SDMX (Statistical Data and Metadata eXchange) which is sponsored by Eurostat and OECD, have produced a hodgepodge of incompatible recommendations. But the most recent list seems to be: T = Total, F = Females, M = Males, NAP = Not applicable, UNK = Unknown.
- Biological.
HL7 (Health Level 7), a standard for interoperability of health information adds a hermaphrodite category. CID 7455, a directive for physicians goes furthest of all with: M = Male, F = Female, U = Unknown Sex, MP = Male pseudohermaphrodite (see the Wikipedia article on Pseudohermaphroditism), FP = Female pseudohermaphrodite, H = Hermaphrodite, MC = Male changed to female, FC = Female changed to male, 121104 = Ambiguous, 121032 = Subject sex (for clinical purposes), 121102 = Other, 121103 = Temporarily undetermined (see the BBC article on determining sex or see “Everything you ever wanted to know about determining sex”).
ISO 5218 is the international standard, and gets a recommendation in a chapter about sex in “Joe Celko’s Data, Measurement and Standards in SQL”. The problem is it only uses digits. This eliminates the worry that it will look biased towards West European languages, but then it looks biased towards males (because ‘1’ comes before ‘2’), and on that basis has been called “offensive”.
More seriously, a digit has no mnemonic value and that means inputters and readers would have to learn more, or developers would have to write more.
Let’s look instead for something that the other “standards” can agree on.
They all say ‘M’ = male and ‘F’ = female, so that’s indisputable.
Most of them say there’s at least one other category, and although it is disputable it is prudent to add ‘O’ = other, forestalling complaints like the lack of ‘fluidity’ objection from the Gay, Lesbian, Bisexual, and Transgender Round Table of the American Library Association.
Most of them say there are codes for “unknown” and “not applicable”, and this should be disputed. Such codes are necessary for simple lists, but for SQL databases they are unnecessary because NULLs exist. If we said that the ‘sex’ column has special values for “unknown” or “not applicable”, but ‘owner’ and ‘species’ do not, we would need inconsistent rules. Therefore “unknown” and “not applicable” are to be handled with NULL.
They all say that ‘M’ and ‘F’ are upper case. Of course it might not matter in MySQL or MariaDB, where the default collation is case insensitive. But other DBMSs are stricter, and we’re thinking about interoperability here, so lower case codes like ‘m’ and ‘f’ are not acceptable equivalents.
Conclusion: allow ‘M’, ‘F’, ‘O’, and NULL.
Should the data type be CHAR(1)?
Now consider what best fits for the range of values, without totally forgetting efficient storage.
- ENUM(‘M’,’F’,’O’)
This has the advantage that input errors, like a sex value of ‘!’, won’t be allowed — the closest that MySQL can come to a CHECK clause. And it’s handy that ENUMs can be searched as numerics, which for this definition would be 1 = M and 2 = F, the same as the ISO/IEC requirement. But ENUM is not standard.
- SMALLINT or TINYINT
Well, any numeric type would be okay for storing an ISO-5218-style digit. However, saying it’s SMALLINT is telling the world “it’s a number”. Just because a digit can appear in a numeric column doesn’t mean that any of the things that are meaningful for numbers (averaging, multiplying, etc.) should be applicable to ‘sex’.
- BOOLEAN
MySQL’s BOOLEAN is just a synonym for TINYINT(1) so is of no use. If we were talking about the real idea of a Boolean — a binary TRUE/FALSE — we’d better change the column name to reflect that the answers are true or false — instead of ‘sex’, the column name should be ‘is_male’.
- VARCHAR(5)
This is a bit more flexible than CHAR(1) if one fears that someday there will be a mandate to store as CID 7455 which has 5-character codes, but VARCHAR might need more space.
- BIT(1) and CHAR(0)
Speaking of needing space, a bunch of BIT columns might be more economical for storage than a bunch of CHAR(1) columns, with MyISAM. And for the CHAR(0) trick, see a 2008 post from Percona. BIT(1) and CHAR(0) would only be for people desperate to save bytes; in other respects they are the worst choice.
Conclusion: yes, the data type should be CHAR(1).
Should there be constraints to prevent bad input?
Since we won’t use ENUM, and we can’t use a CHECK clause, how can we stop people from putting in something other than ‘M’, ‘F’, ‘O’, or NULL?
With old versions we’d have to make views with CHECK OPTION clauses, or perhaps foreign-key references. Nowadays we can make triggers. For example:
DELIMITER // CREATE TRIGGER check_sex AFTER INSERT ON pet FOR EACH ROW BEGIN IF new.sex COLLATE latin1_bin NOT IN ('M','F','O') AND new.sex IS NOT NULL THEN SIGNAL SQLSTATE '22222' SET MESSAGE_TEXT='sex must be ''M'' or ''F'' or ''O'' or NULL'; END IF; END//
For historical reasons that’s not a MySQLish practice, though. Ordinarily I’d expect that people will hope that application code will keep the input clean, and will do periodic cleanups when they find that application code does not keep the input clean.
The Real Point
The MySQL manual is correct about the way to design a “My Pets” database. But if a database description will consistently map to items in the real big wide world, there are going to be multiple things to consider.
Don’t use ENUM because it is “not standard?” But the whole rest of this blog is about MySQL! And you go so far as to use a trigger for validation?
I don’t understand what being “non standard” breaks in this case. If you export as SQL, perhaps, but if you export as CSV, the ENUM will come out as “M”, “F”, or “O”, which will be slurped right up by any other database.
I love ENUMS and use them whenever the set of values is fixed and unchanging — and even some cases where they aren’t, MySQL does the right thing if you add new values.