¶ Death to False (Metal) · 8 August 2007 essay/tech
Two of the very bad features of the SQL-based conception of relational data-modeling, I think, are these:
- one-to-many relationships are significantly harder to handle than one-to-one relationships (and, inversely, many-to-one relationships easiest to handle by defactoring)
- absence is significantly harder to handle than presence
It's easy to add an Artist column to an Album table, for example, so we get:
But if you want to model the fact that collaboration albums can have multiple artists, you need an Albums table, an Artists table, a join table, and a bunch of IDs:
This is already a pain in the ass, and thus you get a world filled with cop-outs like this:
and this:
both of which are tolerable enough to just look at, but awful when you go to try to get the computer to answer even perfectly sensible questions like what albums Boris has done.
And then, when you want to add a little more detail, you really ought to do this:
but instead you'll probably try to get away with:
or
because if you have the latter, you can do:
which is easy to understand, although it won't work because you forgot you don't have a field called plain "Artist" anymore. Whereas with the five-table form you have to do something like:
and that's probably still wrong.
I submit that this is all a colossal mess, and it's a wonder the database-driven software we get driven out of it isn't even more woeful than it is. A better paradigm would embrace the opposites of these two flaws:
- the modeling of a relationship should be the same whether the relationship is one-to-one, one-to-many, many-to-one or many-to-many, and should be the same no matter how many "many"s you have
- absence and presence should be equally easy to assess
And I think a useful rough metric for such a new system is that it allows you to model your data without ever needing to say FALSE.
And I think this, I think, because I think this is how we think and talk about relationships between things when nobody is getting in our way:
Black One was recorded by Sunn O))), and was released in CD, LP and Download formats. Altar was recorded by Sunn O))) and Boris, and was released in CD and LP formats. All of these albums were released on CD and LP. Altar is the only one that wasn't released in Download format.
Take these out of sentence form and make the data-types explicit and everything is still perfectly straightforward:
or, looking at the same data from a different perspective:
and then the old questions are also easier:
Subliminating the one/many distinction allows us to talk about albums with 3 or 8 or 0 artists just as readily as we talk about albums with 1 or 2. Fixing the query language to handle absence lets us cope with the introduction of a new edition into our dataset, say, without having to add a new field for it and/or go through asserting that everything we already knew about doesn't have that format. I suspect pretty much every database field headed X and filled with TRUEs and FALSEs could at least be more usefully modeled in my new world as a relationship called "Characteristics" that either does or doesn't include X. And usually, as with formats, there's actually a meaningful label that contains information itself. The point is that things have relationships to their characteristics, or their formats. They don't, except in a very existential sense, have a relationship to falseness. FALSE is a machine thing.
The relationally astute will note that the five-table version is the fully-normalized representation that would support an SQL-based implementation of my new form as a display abstraction, and presumably my pseudo-query-code could be preprocessed into the messy SQL so I'd never see it. But that's my point: SQL is too low an abstraction, which makes it too hard to do things naturally, which makes it too hard to pass along natural behavior to the victims of the system. I'm mainly arguing for a better abstraction. (Although I also expect it's still always going to be bad that there's no better underlying way to represent lists, so I'm also probably arguing for the better abstraction to not merely be layered on top of the bad one.)
The discographically astute will note that Altar is available on iTunes, and that I have vastly undermodeled the rich and complicated space of Sunn O))) release formats. But if I'd done the examples in real detail, I'd still be typing out the bad versions...
- one-to-many relationships are significantly harder to handle than one-to-one relationships (and, inversely, many-to-one relationships easiest to handle by defactoring)
- absence is significantly harder to handle than presence
It's easy to add an Artist column to an Album table, for example, so we get:
Album | Artist |
Black One | Sunn O))) |
But if you want to model the fact that collaboration albums can have multiple artists, you need an Albums table, an Artists table, a join table, and a bunch of IDs:
Albums
| Artists
| Authorship
|
This is already a pain in the ass, and thus you get a world filled with cop-outs like this:
Album | Artist |
Black One | Sunn O))) |
Altar | Sunn O))) / Boris |
and this:
Album | Artist 1 | Artist 2 | Artist 3 |
Black One | Sunn O))) | FALSE | FALSE |
Altar | Sunn O))) | Boris | FALSE |
both of which are tolerable enough to just look at, but awful when you go to try to get the computer to answer even perfectly sensible questions like what albums Boris has done.
And then, when you want to add a little more detail, you really ought to do this:
Albums
| Artists
| Formats
| Authorship
| Formatship
|
but instead you'll probably try to get away with:
Album | Artist | FormatCode |
Black One | Sunn O))) | CD/LP/D |
Altar | Sunn O))) & Boris | CD/LP |
or
Album | Artist 1 | Artist 2 | Artist 3 | CD | LP | D |
Black One | Sunn O))) | FALSE | FALSE | TRUE | TRUE | TRUE |
Altar | Sunn O))) | Boris | FALSE | TRUE | TRUE | FALSE |
because if you have the latter, you can do:
SELECT Album, Artist
FROM Albums
WHERE D=FALSE
which is easy to understand, although it won't work because you forgot you don't have a field called plain "Artist" anymore. Whereas with the five-table form you have to do something like:
SELECT Albums.Album, Artists.Artist
FROM Albums, Artists, Authorship
WHERE Albums.ID=Authorship.AlbumID
AND Artists.ID=Authorship.ArtistID
AND NOT EXISTS (
SELECT *
FROM Formats, Formatship
WHERE Albums.ID=Formatship.AlbumID
AND Formats.ID=Formatship.FormatID
AND Formats.Format='Download'
)
and that's probably still wrong.
I submit that this is all a colossal mess, and it's a wonder the database-driven software we get driven out of it isn't even more woeful than it is. A better paradigm would embrace the opposites of these two flaws:
- the modeling of a relationship should be the same whether the relationship is one-to-one, one-to-many, many-to-one or many-to-many, and should be the same no matter how many "many"s you have
- absence and presence should be equally easy to assess
And I think a useful rough metric for such a new system is that it allows you to model your data without ever needing to say FALSE.
And I think this, I think, because I think this is how we think and talk about relationships between things when nobody is getting in our way:
Black One was recorded by Sunn O))), and was released in CD, LP and Download formats. Altar was recorded by Sunn O))) and Boris, and was released in CD and LP formats. All of these albums were released on CD and LP. Altar is the only one that wasn't released in Download format.
Take these out of sentence form and make the data-types explicit and everything is still perfectly straightforward:
Album: [Black One]
Artists: [Sunn O)))]
Formats: [CD] [LP] [Download]
Album: [Altar]
Artists: [Sunn O)))] [Boris]
Formats: [CD] [LP]
or, looking at the same data from a different perspective:
Format: [CD]
Albums: [Black One] [Altar]
Format: [LP]
Albums: [Black One] [Altar]
Format: [Download]
Albums: [Black One]
and then the old questions are also easier:
FIND Formats WHOSE Albums INCLUDE 'Black One' and 'Altar'
FIND Albums WHOSE Formats DON'T INCLUDE 'Download'
Subliminating the one/many distinction allows us to talk about albums with 3 or 8 or 0 artists just as readily as we talk about albums with 1 or 2. Fixing the query language to handle absence lets us cope with the introduction of a new edition into our dataset, say, without having to add a new field for it and/or go through asserting that everything we already knew about doesn't have that format. I suspect pretty much every database field headed X and filled with TRUEs and FALSEs could at least be more usefully modeled in my new world as a relationship called "Characteristics" that either does or doesn't include X. And usually, as with formats, there's actually a meaningful label that contains information itself. The point is that things have relationships to their characteristics, or their formats. They don't, except in a very existential sense, have a relationship to falseness. FALSE is a machine thing.
The relationally astute will note that the five-table version is the fully-normalized representation that would support an SQL-based implementation of my new form as a display abstraction, and presumably my pseudo-query-code could be preprocessed into the messy SQL so I'd never see it. But that's my point: SQL is too low an abstraction, which makes it too hard to do things naturally, which makes it too hard to pass along natural behavior to the victims of the system. I'm mainly arguing for a better abstraction. (Although I also expect it's still always going to be bad that there's no better underlying way to represent lists, so I'm also probably arguing for the better abstraction to not merely be layered on top of the bad one.)
The discographically astute will note that Altar is available on iTunes, and that I have vastly undermodeled the rich and complicated space of Sunn O))) release formats. But if I'd done the examples in real detail, I'd still be typing out the bad versions...