¶ Whole Data (live version 2) and Thread · 16 October 2008 essay/tech
I gave another version of my Whole Data blog post as part of a panel called "Death of the Relational Database?" at the Web 3.0 conference today.
My answer to the question in the panel-title is that the relational database was always mostly a coping mechanism for a time of relative scarcity, when our data volume and needs exceeded our storage and processing capacities. This era is not entirely past us, but with more memory, faster data-loading, faster processing and better programming tools, there are increasingly many datasets of larger and larger size for which we can now support qualitatively richer abstractions. Where we used to have to break data apart, and thus were constrained to interact with it in fragmented and broken ways, we can now increasingly afford to keep it whole.
And here, then, to try to explain why that matters, is my current four-point reduction of the Whole Data manifesto:
1. What logic allows, storage may not constrain.
The forms of the logical data-model and its inquiries are by definition independent of any extra-logical storage mechanics. That is, a proper data system exists for the express purpose of sustaining the illusion that the logical data-model is real. This must also be true over time: logically-sensible changes in the logical data model cannot be constrained by storage mechanics.
2. All relationships are lists.
Not only must multiple-value relationships and no-value relationships be exactly as easily expressed, manipulated and inquired about as single-value relationships, but every fundamental data operation should thrive on multiplicity and sequence.
3. There is no up or down, only onwards from here.
Humans impose hierarchy and directionality on data, but different humans recognize different hierarchies in the same data at the same time. The system itself must be agnostic. No fact is inherently the property of another, they are all independent and structurally equal, and although pairs of relationships can be inverses of each other, there is no absolute up or down, or forwards or backwards. A database of albums, artists and years, for example, is exactly as much a database of years as it is of artists as it is of albums, not by the generous will of its schema-creator, but by the intrinsic nature of data. Most significantly, you must be able to start at any node and see everything else in this dataset to which, and from which, it is connected. In fact, turn this around and you get the essential practical definition of a dataset: it is a collection of data in which all internal connections are expressed.
4. Human inquiry follows paths; machine inquiry follows all the paths at once.
The reason to put data into computers is so that computers can answer questions that are meaningful to humans, but faster for computers. A data model and query language exist to communicate human needs to computers, not to communicate machine preferences to humans, and the responsiveness of a data system is its qualitative ability to answer questions with human motivations, not its quantitative throughput.
At the end of this talk I also did the first (very brief) public demo of Thread, the path-based query language I've written as part of the data-modeling (among other things) system I've been working on at ITA Software. I'm looking forward (to put it mildly) to being able to talk much more about this, but for now I just whirled through a fast series of mostly-unexplicated queries that at least gestured in the direction of the points in my manifesto:
Album:Year=2000
Album:Artist=Nightwish
Artist:Album~Dark
Label:Album~Dark
Album:~Dark.Label
Artist:(.Album.Label:=Spinefarm)
Label:=Spinefarm.Album.Artist
Artist:(.Album:#5.Year:>2000)
Artist:!(.Album:#5)
I also mentioned in the talk, but didn't actually show, the hypothetical query "Who are all the living movie directors who ever directed Cary Grant?":
Actor:=Cary Grant.Film.Director:!(.Date of Death)
Go ahead and try to answer that in IMDB without a query language. For that matter, try to answer it in Freebase with a query language. (Or, even worse, try to answer it in Freebase yesterday, before I fixed a bunch of dates of death by hand...)
There's also a short article up on semanticweb.com at the moment, which if nothing else gives a transcribed sense of how I explain these ideas differently in a phone interview than I do in writing.
And there's space for discussion here.
My answer to the question in the panel-title is that the relational database was always mostly a coping mechanism for a time of relative scarcity, when our data volume and needs exceeded our storage and processing capacities. This era is not entirely past us, but with more memory, faster data-loading, faster processing and better programming tools, there are increasingly many datasets of larger and larger size for which we can now support qualitatively richer abstractions. Where we used to have to break data apart, and thus were constrained to interact with it in fragmented and broken ways, we can now increasingly afford to keep it whole.
And here, then, to try to explain why that matters, is my current four-point reduction of the Whole Data manifesto:
1. What logic allows, storage may not constrain.
The forms of the logical data-model and its inquiries are by definition independent of any extra-logical storage mechanics. That is, a proper data system exists for the express purpose of sustaining the illusion that the logical data-model is real. This must also be true over time: logically-sensible changes in the logical data model cannot be constrained by storage mechanics.
2. All relationships are lists.
Not only must multiple-value relationships and no-value relationships be exactly as easily expressed, manipulated and inquired about as single-value relationships, but every fundamental data operation should thrive on multiplicity and sequence.
3. There is no up or down, only onwards from here.
Humans impose hierarchy and directionality on data, but different humans recognize different hierarchies in the same data at the same time. The system itself must be agnostic. No fact is inherently the property of another, they are all independent and structurally equal, and although pairs of relationships can be inverses of each other, there is no absolute up or down, or forwards or backwards. A database of albums, artists and years, for example, is exactly as much a database of years as it is of artists as it is of albums, not by the generous will of its schema-creator, but by the intrinsic nature of data. Most significantly, you must be able to start at any node and see everything else in this dataset to which, and from which, it is connected. In fact, turn this around and you get the essential practical definition of a dataset: it is a collection of data in which all internal connections are expressed.
4. Human inquiry follows paths; machine inquiry follows all the paths at once.
The reason to put data into computers is so that computers can answer questions that are meaningful to humans, but faster for computers. A data model and query language exist to communicate human needs to computers, not to communicate machine preferences to humans, and the responsiveness of a data system is its qualitative ability to answer questions with human motivations, not its quantitative throughput.
At the end of this talk I also did the first (very brief) public demo of Thread, the path-based query language I've written as part of the data-modeling (among other things) system I've been working on at ITA Software. I'm looking forward (to put it mildly) to being able to talk much more about this, but for now I just whirled through a fast series of mostly-unexplicated queries that at least gestured in the direction of the points in my manifesto:
Album:Year=2000
albums from the year 2000; a one-to-one relationship in the old world, but in the new world the year 2000 is a real thing, too, making this many-to-one
Album:Artist=Nightwish
albums by Nightwish; a more familiar many-to-one relationship
Artist:Album~Dark
Artists who did an album with "Dark" in the name; a one-to-many relationship with the same syntax as above
Label:Album~Dark
Labels that put out albums with "Dark" in the name; same syntax again, but following a different path to Albums
Album:~Dark.Label
same question again, but following the path from Albums, instead of to them
Artist:(.Album.Label:=Spinefarm)
Artists who've had albums on Spinefarm; same pattern as above, but filtering on a compound relationship
Label:=Spinefarm.Album.Artist
or go the other way
Artist:(.Album:#5.Year:>2000)
Artists whose fifth album came out after 2000; multiples and missing values (one of the bands in the sample data had only four albums)
Artist:!(.Album:#5)
artists who don't have a fifth album; filtering on absence
I also mentioned in the talk, but didn't actually show, the hypothetical query "Who are all the living movie directors who ever directed Cary Grant?":
Actor:=Cary Grant.Film.Director:!(.Date of Death)
Go ahead and try to answer that in IMDB without a query language. For that matter, try to answer it in Freebase with a query language. (Or, even worse, try to answer it in Freebase yesterday, before I fixed a bunch of dates of death by hand...)
There's also a short article up on semanticweb.com at the moment, which if nothing else gives a transcribed sense of how I explain these ideas differently in a phone interview than I do in writing.
And there's space for discussion here.