December 2013 – SQL And Its Sequels

Month: December 2013

The USA’s healthcare.gov site and LAMP

The USA’s health care exchange site, healthcare.gov, has had well-publicized initial woes.

The New York Times has said one of the problems was the government’s choice of DBMS, namely MarkLogic. A MarkLogic employee has said that “If the exact same processes and analysis were applied to a LAMP stack or an Oracle Exa-stack, the results would have likely been the same.”

I don’t know why he picked Exastack for comparison, but I too have wondered whether things would have been different if the American government had chosen a LAMP component (MySQL or MariaDB) as a DBMS, instead of MarkLogic.

What is MarkLogic?

The company is a software firm founded in 2001 based in San Carlos California. It has 250 employees. The Gartner Magic Quadrant classes it as a “niche player” in the Operational DBMS Category.

The product is a closed-source XML DBMS. The minimum price for a perpetual enterprise license is $32,000 but presumably one would also pay for support, just as one does with MySQL or MariaDB.

There are 250 customers. According to the Wall Street Journal “most of its sales come from dislodging Oracle Corp.”

One of the customers, since 2012 or before, is CMS (the Centers for Medicare and Medicaid), which is a branch of the United States Department of Health and Human Services. CMS is the agency that built the healthcare.gov online portal.

Is MarkLogic responsible for the woes?

Probably MarkLogic is not the bottleneck.

It’s not even the only DBMS that the application queries. There is certainly some contact with other repositories during a get-acquainted process, including Oracle Enterprise Identity management — so one could just as easily blame Oracle.

There are multiple other vendors. USA Today mentions Equifax, Serco, Optum/QSSI, and the main contractor
CGI Federal.

A particular focus for critics has been a web-hosting provider, Verizon Terremark. They have been blamed for some of the difficulties and will eventually be replaced by an HP solution. HP also has a fairly new contract for handling the replication.

Doubtless all the parties would like to blame the other parties, but “the Obama administration has requested that all government officials and contractors involved keep their work confidential”.

It’s clear, though, that the site was launched with insufficient hardware. Originally it was sharing machines with other government services. That’s changed. Now it has dedicated machines.

But the site cost $630 million so one has to suppose they had money to buy hardware in the first place. That suggests that something must have gone awry with the planning, and so it’s credible what a Forbes article is saying, that the government broke every rule of project management.

So we can’t be sure because of the government confidentiality requirement, but it seems unlikely that MarkLogic will get the blame when the dust settles.

Is MarkLogic actually fast?

One way to show that MarkLogic isn’t responsible for slowness, would be to look for independent confirmations of its fastness. The problem with that is MarkLogic’s evaluator-license agreement, from which I quote:

…
MarkLogic grants to You a limited, non-transferable, non-exclusive, internal use license in the United States of America
…
[You must not] disclose, without MarkLogic’s prior written consent, performance or capacity statistics or the results of any benchmark test performed on Software
…
[You must not] use the Product for production activity,
…
You acknowledge that the Software may electronically transmit to MarkLogic summary data relating to use of the Software

— http://developer.marklogic.com/products

These conditions aren’t unheard of in the EULA world, but they do have the effect that I can’t look at the product at all (I’m not in the United States), and others can look at the product but can’t say what they find wrong with it.

So it doesn’t really matter that Facebook got 13 million transactions/second in 2011, or that the HandlerSocket extension for MySQL got 750,000 transactions/second with a lot less hardware. Possibly MarkLogic could do better. And I think we can dismiss the newspaper account that MarkLogic “continued to perform below expectations, according to one person who works in the command center.” Anonymous accounts don’t count.

So we can’t be sure because of the MarkLogic confidentiality requirements, but it seems possible that MarkLogic could outperform its SQL competitors.

Is MarkLogic responsible for absence of High Availability?

High Availability shouldn’t be an issue.

At first glance the reported uptime of the site — 43% initially, 90% now — looks bad. After all, Yves Trudeau surveyed MySQL High Availability solutions years ago and found even the laggards were doing 98%. Later the OpenQuery folks reported that some customers find “five nines” (99.999%) is too fussily precise so let’s just round it to a hundred.

At second glance, though, the reported uptime of the site is okay.

First: The product only has to work in 36 American states and Hawaii is not one of them. That’s only five time zones, then. So it can go down a few hours per night for scheduled maintenance. And uptime “exclusive of scheduled maintenance” is actually 95%.

Second: It’s okay to have debugging code and extra monitoring going on during the first few months. I’m not saying that’s what’s happening — indeed the fact that they didn’t do a 500-simulated-sites test until late September suggests they aren’t worry warts — but it is what others would have done, and therefore others would also be below 99% at this stage of the game.

So, without saying that 90 is the new 99, I think we can admit that it wouldn’t really be fair to make a big deal about some LAMP installation that has higher availability than healthcare.gov.

Is it hard to use?

MarkLogic is an XML DBMS. So its principal query language is XQuery, although there’s a section in the manual about how you could use SQL in a limited way.

Well, of course, to me and to most readers of this blog, XQuery is murky gibberish and SQL is kindergartenly obvious. But we have to suppose that there are XML experts who would find the opposite.

What, then, can we make out of the New York Times’s principal finding about the DBMS? It says:

“Another sore point was the Medicare agency’s decision to use database software, from a company called MarkLogic, that managed the data differently from systems by companies like IBM, Microsoft and Oracle. CGI officials argued that it would slow work because it was too unfamiliar. Government officials disagreed, and its configuration remains a serious problem.”

— New York Times November 23 2013

Well, of course, to me and to most readers of this blog, the CGI officials were right because it really is unfamiliar — they obviously had people with experience in IBM DB2, Microsoft SQL Server, or Oracle (either Oracle 12c or Oracle MySQL). But we have to suppose that there are XML experts who would find the opposite.

And, though I think it’s a bit extreme, we have to allow that it’s possible the problems were due to sabotage by Oracle DBAs.

Yet again, it’s impossible to prove that MarkLogic is at fault, because we’re all starting off with biases.

Did the problem have something to do with IDs?

I suspect there was an issue with IDs (identifications).

It starts off with this observation of a MarkLogic feature: “Instead of storing strings as sequences of characters, each string gets stored as a sequence of numeric token IDs. The original string can be reconstructed using the dictionary as a lookup table.”

It ends with this observation from an email written on September 27 2013 by a healthcare.gov worker: “The generation of identifiers within MarkLogic was inefficient. This was fixed and verified as part of the 500 user test.”

Of course that’s nice to see it was fixed, but isn’t it disturbing that a major structural piece was inefficient as late as September?

Hard to say. Too little detail. So the search for a smoking gun has so far led nowhere.

Is it less reliable?

Various stories — though none from the principals — suggest that MarkLogic was chosen because of its flexibility. Uh-oh.

The reported quality problems are “one in 10 enrollments through HealthCare.gov aren’t accurately being transmitted” and “duplicate files, lack of a file or a file with mistaken data, such as a child being listed as a spouse.”

I don’t see how the spousal problem could have been technical, but the duplications and the gone-missings point to: uh-oh, lack of strong rules about what can go in. And of course strong rules are something that the “relational” fuddy-duddies have worried about for decades. If the selling point of MarkLogic is in fact leading to a situation which is less than acceptable, then we have found a flaw at last. In fact it would suggest that the main complaints so far have been trivia.

This is the only matter that I think looks significant at this stage.

How’s that hopey-changey stuff working out for your Database?

The expectation of an Obama aide was: “a consumer experience unmatched by anything in government, but also in the private sector.”

The result is: so far not a failure, and nothing that shows that MarkLogic will be primarily responsible if it is a failure.

However: most of the defence is along the lines of “we can’t be sure”. That cuts both ways — nobody can say it’s “likely” that LAMP would have been just as bad.

pgulutzan, December 8, 2013. Category: MySQL / MariaDB, NoSQL.

Tuples

“It is better to keep silence and be thought a fool, than to say ‘tuple’ and remove all doubt.”

But recently people have been using the word “tuple” more frequently. Doubtless all those people know that in relational databases a tuple (or a tuple value) is a formal term for a row of a table. It’s possible to know a bit more than that.

Pronounced Tyoople, Toople, or Tuhple?

The Oxford Dictionaries site says Tyoople. Other dictionaries are neutral about the terms from which Tuple was derived (sextuple, octuple, etc.), for example Merriam-Webster says they usually end in Toople but Tuhple is an accepted alternate, and the Oxford Canadian Dictionary says it’s always Tuhple. So the question comes down to: what is the proper way in a database context?

I found one book that says Tuple rhymes with Scruple, that is, it’s Toople: Rod Stephens, Beginning Database Design Solutions. But Mr Stephens also tells us that tables are called relations because “the values of a row are related”, so I’m wary about him.

I found four books that say Tuple rhymes with Couple, that is, it’s Tuhple:

David Kroenke, Database Processing: Fundamentals, Design and Implementation
Kevin Loney, Oracle9i The Complete Reference
Donald Burleson, Oracle High-Performance SQL Tuning
Paul Nielsen, SQL Server 2005 Bible

Then I found the decisive one:
C.J.Date, An Introduction To Database Systems.
I quote:

The relational model therefore does not use the term “record” at all; instead it uses the term “tuple” (rhymes with “couple”).

Since C.J.Date had many conversations with the The Founder (E.F.Codd), and Mr Codd would have been certain to correct Mr Date if he had mispronounced, this is decisive. Most writers in the field, including the one who ought to know, are saying that it rhymes with couple.

Wait a minute — couple was originally a French word, and the French would use an oo sound, what about that? A good explanation, although it’s based on analogy, is that some Middle English words with oo (like blood and flood) changed in stages from the oo sound to the uh sound in the centuries following the Great English Vowel Shift. See the Wikipedia article about Phonological history of English high back vowels. So the reply to people who say “etymologically it was an oo sound” would be “yes, but oo changed to uh as part of a trend, get modern”.

But what if it’s a non-relational tuple?

Quoting C.J.Date again (from The Relational Database Dictionary):

NOTE: Tuples as defined in the relational model differ in certain respects from the mathematical construct of the same name. In particular, tuples in mathematics typically don’t have named attributes; instead, their attributes are identified by their ordinal position, left to right.

Aha. So if a sequence of values doesn’t have a corresponding header with a sequence of column names, it shouldn’t be called a row (that would be relational) but it could be called a tuple — provided it’s not in a relational database. In practice that’s seems to be a fairly common usage, but I’ll highlight the products where it seems to be the preferred usage.

Tuple Spaces.
The modest idea of just filling a space with tuples, and calling it a
“tuple space”, started off in 1982 with a language named Linda. Since then the general concept has gotten into various Java implementations.
Python.
The tuple is a supported data type that’s part of the Python core.
Of course there are similar things in other languages but I believe
that Python is the most prominent language that actually calls it a
tuple.
Pig.
FoundationDB.
Tarantool.
Actually one of my current projects is enhancing Tarantool’s documentation, which is what led me to wonder about the word.

The MySQL manual usually avoids the term, although it’s more frequent with NDB.

Alas

Recently I saw a poll with a single question:

What is to be Done? (a) Nothing (b) Something

I think that (a) won, hurrah. And yet it would have been a finer world if everyone had agreed that “tuple” meant a sequence of values, “record” meant a sequence of values which had fixed types, and “row” meant a sequence of values which had both fixed types and fixed names. If only Mr Codd had left the vocabulary alone …

pgulutzan, December 2, 2013. Category: Standard SQL.