23 February 2010 to 27 September 2009
There's a startup called Etacts whose software can import your email (headers), and give you a view by person instead of by message. This is pretty obviously something you'd like to be able to see, if you have any significant volume of correspondence or correspondents. And it's pretty obviously a large leap of faith to give some random startup the ability to log into your email account.
But, even more, this is a perfect example of something you ought to be able to do, at least basically, with your data without needing any extra application. Etacts provides some unique contact-management features, but the fundamental function of rotating data to some other axis is, I contend, rightfully a property of the data. You already own the view of your email by recipient, because it is immanent in your email. But your email service probably doesn't let you see it. I would like it to be the responsibility of a data-holding service to give you full logical access to that data.
This is a huge part of why my work project matters to me. It is my attempt to design a universal data-modeling paradigm, and associated query-language, that could be a candidate standard for what it means to provide "full logical access" to data.
But, even more, this is a perfect example of something you ought to be able to do, at least basically, with your data without needing any extra application. Etacts provides some unique contact-management features, but the fundamental function of rotating data to some other axis is, I contend, rightfully a property of the data. You already own the view of your email by recipient, because it is immanent in your email. But your email service probably doesn't let you see it. I would like it to be the responsibility of a data-holding service to give you full logical access to that data.
This is a huge part of why my work project matters to me. It is my attempt to design a universal data-modeling paradigm, and associated query-language, that could be a candidate standard for what it means to provide "full logical access" to data.
¶ Lieu, Light and Favor · 10 February 2010
Usage Notes
"In lieu of" means "instead of", with the general implication that the object of the phrase is absent or unavailable. So "I ordered onion rings in lieu of fries" if they were out of fries and you were forced to choose something else, but "I ordered onion rings instead of fries" is better if it's just your choice. "In place of" works for both, too.
"In light of" means "as a result of", with the sense of having changed something because you observed (thus the "light", by which you see) something else. So "In light of the overwhelming shift in demand from fries to onion rings, I recommend we reduce our potato order." Or just "Let's get more onions instead of so many potatoes. Everybody hates potatoes now, apparently."
"In favor of" means "in place of", like "instead of" but in the opposite order, with a sense of some kind of trade-off. "We reduced our potato order in favor of more onions to satisfy rampant onion-ring demand." Or "We bought more onions instead of potatoes, because everybody is ordering onion rings for some weird reason now. Did somebody on Chowhound post something good about our onion rings? Or bad about our fries?!"
In light of frequent misuse, in lieu of specific needs I recommend eschewing all three in favor of saying what you know you mean.
"In lieu of" means "instead of", with the general implication that the object of the phrase is absent or unavailable. So "I ordered onion rings in lieu of fries" if they were out of fries and you were forced to choose something else, but "I ordered onion rings instead of fries" is better if it's just your choice. "In place of" works for both, too.
"In light of" means "as a result of", with the sense of having changed something because you observed (thus the "light", by which you see) something else. So "In light of the overwhelming shift in demand from fries to onion rings, I recommend we reduce our potato order." Or just "Let's get more onions instead of so many potatoes. Everybody hates potatoes now, apparently."
"In favor of" means "in place of", like "instead of" but in the opposite order, with a sense of some kind of trade-off. "We reduced our potato order in favor of more onions to satisfy rampant onion-ring demand." Or "We bought more onions instead of potatoes, because everybody is ordering onion rings for some weird reason now. Did somebody on Chowhound post something good about our onion rings? Or bad about our fries?!"
In light of frequent misuse, in lieu of specific needs I recommend eschewing all three in favor of saying what you know you mean.
Now that Needle, my work project, is finally no longer secret, I'm starting my slow seditious campaign to subvert the entire Semantic Web establishment. "Entire" is maybe a big word for a small world that not very many people care about, but part of the reason I care is that I think more people would care about the problems this community is addressing if the nature of the problems weren't so obscured by the prevailing ostensible solutions.
I'll be taking this argument on the road, albeit only a couple blocks down it, for a short talk at the Cambridge Semantic Web Meetup Group, next Tuesday night (9 Feb, 6pm) at MIT.
I'll be taking this argument on the road, albeit only a couple blocks down it, for a short talk at the Cambridge Semantic Web Meetup Group, next Tuesday night (9 Feb, 6pm) at MIT.
DERI semantic-web researcher Alexandre Passant just announced a semantic-web-based music-recommendation engine called dbrec. It runs on dbpedia data, and computes "Linked Data Semantic Distance" between bands to find likely suggestions. This is an intriguing premise, and probably a worthy experiment.
The site isn't labeled "intriguing experiment", though, it's labeled "intelligent music recommendations". Here are its top "intelligent" recommendations for, just to pick a random example near the beginning of the alphabet, Annihilator:
Jeff Waters
Primal Fear
D.O.A.
Extreme
Randy Black
Megadeth
Bif Naked
"Intelligent" is not really the word for this.
On the one hand, the quality of the recommendations is not mostly Passant's fault. The underlying data isn't that great, and you can see how not-great it is in Passant's generated explanations, like this one for how Jeff Waters and Annihilator relate:
This is an incredibly obtuse way of saying, as the human-readable Wikipedia article about Waters puts it in the first sentence: "Jeff Waters is the guitarist and mastermind of the thrash metal band Annihilator". Passant's data doesn't quite record this fact, so he's left to try to make sense of the difference between "associated musical artist", "associated act" and "associated band" and "reference".
Unsurprisingly, not much interesting sense results. Everything in this example is connected primarily through personnel overlap. Drummer Randy Black has played in both Primal Fear and Annihilator, and on one Bif Naked album. D.O.A. founder Randall Archibald sang on two Annihilator albums. Extreme are on the list because drummer Mike Mangini played briefly in both bands. Bif and Annihilator share the hometown "Canada".
These connections aren't irrelevant, exactly. If you were trying to get the phone number of Annihilator's booking agent, they might be worth scanning through in case you spot somebody you went to high school with.
As musical recommendations, though, they suck. As "intelligent" musical recommendations they're idiotic. Annihilator is a thrash-metal band, Extreme were metal-derived MOR pop, Bif Naked is a punk singer. Compare the list, with no claim of "intelligence", for Annihilator on empath:
A person could probably do better than this, too, especially if they're allowed extra adjectives and a lower granularity than artists ("like Kill 'Em All-era Metallica", or "started off like early Slayer, but with more emphasis on technique"; and I don't even know Annihilator very well), but this list is at least not inane.
And yet, Passant's work is almost certainly more technically sophisticated than mine. I used one genre, one data-source and one connection-metric, and produced a deliberately simple web-site with almost no ancillary information. Passant had to confront the sprawl of the Linked Open Data "cloud", figure out non-obvious weightings for a bunch of different connection paths, and display a lot more information than I deal with, in both breadth and depth.
And yet, and yet, and yet: The recommendations are bad. Or, more accurately, the connections are what they are, but calling them recommendations is bad. Calling them "intelligent" is worse, and presenting the combination of "intelligent", "recommendation" and "linked data" to the general public is deadly. If "Linked Data" and "Semantic Web" mean ways for machines to tell me that if I like Annihilator I should listen to Extreme, then nobody needs them. If Linked Data, the movement, can't tell the difference between intelligent and idiotic, it's not to be trusted on anything.
[2013 Postscript]
The site isn't labeled "intriguing experiment", though, it's labeled "intelligent music recommendations". Here are its top "intelligent" recommendations for, just to pick a random example near the beginning of the alphabet, Annihilator:
Jeff Waters
Primal Fear
D.O.A.
Extreme
Randy Black
Megadeth
Bif Naked
"Intelligent" is not really the word for this.
On the one hand, the quality of the recommendations is not mostly Passant's fault. The underlying data isn't that great, and you can see how not-great it is in Passant's generated explanations, like this one for how Jeff Waters and Annihilator relate:
Annihilator (band) is 'associated musical artist' of Jeff Waters (7 artists sharing it)
Annihilator (band) is 'associated acts' of Jeff Waters (7 artists sharing it)
Annihilator (band) is 'associated band' of Jeff Waters (7 artists sharing it)
Jeff Waters is 'current members' of Annihilator (band) (2 artists sharing it)
Annihilator (band) and Jeff Waters share the same value for 'genre'
- Thrash metal (529 artists sharing it)
- Groove metal (101 artists sharing it)
- Speed metal (170 artists sharing it)
- Heavy Metal Music (1534 artists sharing it)
Annihilator (band) and Jeff Waters share the same value for 'reference' (1 artists sharing it)
Annihilator (band) is 'associated acts' of Jeff Waters (7 artists sharing it)
Annihilator (band) is 'associated band' of Jeff Waters (7 artists sharing it)
Jeff Waters is 'current members' of Annihilator (band) (2 artists sharing it)
Annihilator (band) and Jeff Waters share the same value for 'genre'
- Thrash metal (529 artists sharing it)
- Groove metal (101 artists sharing it)
- Speed metal (170 artists sharing it)
- Heavy Metal Music (1534 artists sharing it)
Annihilator (band) and Jeff Waters share the same value for 'reference' (1 artists sharing it)
This is an incredibly obtuse way of saying, as the human-readable Wikipedia article about Waters puts it in the first sentence: "Jeff Waters is the guitarist and mastermind of the thrash metal band Annihilator". Passant's data doesn't quite record this fact, so he's left to try to make sense of the difference between "associated musical artist", "associated act" and "associated band" and "reference".
Unsurprisingly, not much interesting sense results. Everything in this example is connected primarily through personnel overlap. Drummer Randy Black has played in both Primal Fear and Annihilator, and on one Bif Naked album. D.O.A. founder Randall Archibald sang on two Annihilator albums. Extreme are on the list because drummer Mike Mangini played briefly in both bands. Bif and Annihilator share the hometown "Canada".
These connections aren't irrelevant, exactly. If you were trying to get the phone number of Annihilator's booking agent, they might be worth scanning through in case you spot somebody you went to high school with.
As musical recommendations, though, they suck. As "intelligent" musical recommendations they're idiotic. Annihilator is a thrash-metal band, Extreme were metal-derived MOR pop, Bif Naked is a punk singer. Compare the list, with no claim of "intelligence", for Annihilator on empath:
0.228 Anthrax
0.224 Dio
0.206 Sodom
0.182 Saxon
0.176 Black Label Society
0.175 Onslaught (Gbr)
0.170 Grave Digger
0.224 Dio
0.206 Sodom
0.182 Saxon
0.176 Black Label Society
0.175 Onslaught (Gbr)
0.170 Grave Digger
A person could probably do better than this, too, especially if they're allowed extra adjectives and a lower granularity than artists ("like Kill 'Em All-era Metallica", or "started off like early Slayer, but with more emphasis on technique"; and I don't even know Annihilator very well), but this list is at least not inane.
And yet, Passant's work is almost certainly more technically sophisticated than mine. I used one genre, one data-source and one connection-metric, and produced a deliberately simple web-site with almost no ancillary information. Passant had to confront the sprawl of the Linked Open Data "cloud", figure out non-obvious weightings for a bunch of different connection paths, and display a lot more information than I deal with, in both breadth and depth.
And yet, and yet, and yet: The recommendations are bad. Or, more accurately, the connections are what they are, but calling them recommendations is bad. Calling them "intelligent" is worse, and presenting the combination of "intelligent", "recommendation" and "linked data" to the general public is deadly. If "Linked Data" and "Semantic Web" mean ways for machines to tell me that if I like Annihilator I should listen to Extreme, then nobody needs them. If Linked Data, the movement, can't tell the difference between intelligent and idiotic, it's not to be trusted on anything.
[2013 Postscript]
The short version:
Metal
1. Madder Mortem: Eight Ways
2. Secrets of the Moon: Privilegivm
3. Thy Catafalque: Róka Hasa Rádió
4. Antigua y Barbuda: Try Future
5. Funeral Mist: Maranatha
6. Absu: Absu
7. Wardruna: Runaljod - Gap Var Ginnunga
8. Lifelover: Dekadens
9. Amorphis: Skyforger
10. Cantata Sangui: On Rituals and Correspondence in Constructed Realities
Not Metal
1. Manic Street Preachers: Journal for Plague Lovers demos + album + remixes
2. Tori Amos: Midwinter Graces and Abnormally Attracted to Sin
3. Idlewild: Post Electric Blues
4. It Bites: The Tall Ships
5. Bat for Lashes: Two Suns
6. Wheat: White Ink Black Ink
7. Stars of Track and Field: A Time for Lions
8. Tegan & Sara: Sainthood
9. Metric: Fantasies
10. Maxïmo Park: Quicken the Heart
The long version:
TWAS 511: Lay Down (With) Your Armor
The listen-for-yourself versions:
- furia2009metal.zip (21 songs, 2 hours, 110MB)
- furia2009nonmetal.zip (31 songs, 2 hours, 111MB)
Metal
1. Madder Mortem: Eight Ways
2. Secrets of the Moon: Privilegivm
3. Thy Catafalque: Róka Hasa Rádió
4. Antigua y Barbuda: Try Future
5. Funeral Mist: Maranatha
6. Absu: Absu
7. Wardruna: Runaljod - Gap Var Ginnunga
8. Lifelover: Dekadens
9. Amorphis: Skyforger
10. Cantata Sangui: On Rituals and Correspondence in Constructed Realities
Not Metal
1. Manic Street Preachers: Journal for Plague Lovers demos + album + remixes
2. Tori Amos: Midwinter Graces and Abnormally Attracted to Sin
3. Idlewild: Post Electric Blues
4. It Bites: The Tall Ships
5. Bat for Lashes: Two Suns
6. Wheat: White Ink Black Ink
7. Stars of Track and Field: A Time for Lions
8. Tegan & Sara: Sainthood
9. Metric: Fantasies
10. Maxïmo Park: Quicken the Heart
The long version:
TWAS 511: Lay Down (With) Your Armor
The listen-for-yourself versions:
- furia2009metal.zip (21 songs, 2 hours, 110MB)
- furia2009nonmetal.zip (31 songs, 2 hours, 111MB)
After deconstructing the annual Village Voice music poll for many years, this year I actually helped construct it, doing the data-correction and tabulation myself in Needle, the new database system I work on at ITA Software. In conjunction with this, the system itself is finally making its public debut! We are still in private beta for people who want to build data-sets using our tools, but the Voice has allowed us to share the underlying data from the poll as a sample of what data-sets look like in Needle.
I will have, probably, far too much to say about this over time, now that I can actually show people a partial glimpse of what I've been doing for a living for 3+ years. For now, here are the links:
Pazz & Jop 2009 (official results)
Needle - Pazz & Jop 2009 (system intro and data explorer)
All-Idols 2009 (centricity, similarity, kvltosis and other assorted stats)
I will have, probably, far too much to say about this over time, now that I can actually show people a partial glimpse of what I've been doing for a living for 3+ years. For now, here are the links:
Pazz & Jop 2009 (official results)
Needle - Pazz & Jop 2009 (system intro and data explorer)
All-Idols 2009 (centricity, similarity, kvltosis and other assorted stats)
¶ 4 November 2009
Fear is one thing, but voting for hate when love would have cost you nothing is pretty much the quintessence of cowardice. Love will win, and after that, telling the story of standing up for divisiveness and intolerance will be like being proud of having hand-lettered "No Coloreds" signs for water-fountains.
And although the individual people will be forgiven, the institutions should be forfeit. In particular, any church that campaigned against same-sex marriage has violated the social contract that permits religion and rational society to coexist, and should be seized by eminent domain and fumigated for ignorance.
And although the individual people will be forgiven, the institutions should be forfeit. In particular, any church that campaigned against same-sex marriage has violated the social contract that permits religion and rational society to coexist, and should be seized by eminent domain and fumigated for ignorance.
In data-aggregation we are constantly trying to deduplicate references. One dataset says "Olecito", one says "Olecito -- Cambridge", a third says "Olecito Taqueria", and a fourth says "Cielito Taqueria" in the same city as the others, and we'd really like to know whether we have four restaurants, or one, or they're all imaginary or closed or what.
Aligning the variant names is a problem in itself, but it's not the hard one. The hard one is when the concepts themselves don't quite line up. One dataset may be a list of employers, for example, so one "Olecito" is actually a corporate entity with various business properties, including a business address for their accountant's office, and if we show up there wanting a steak quesadilla we will probably (although I don't want to underestimate the guy) be disappointed.
In Linked Data practice (and I'm not currently arguing that you need to know what that is if you don't already), the most common tool for asserting equivalence is a relationship called "owl:sameAs". Out of context this seems like some kind of Star Wars joke by way of Harry Potter: those were not the droids you were looking for, but these are the owls you want. But the "owl:" part is just a bit of pointlessly off-putting data-modeling pedantry in which every single thing you say has to be individually attributed, like saying "These scrambled (in the Epicurious sense) eggs (in the Wikipedia sense) are (in the OED sense) cold (in the Eastern West Virginia State College Department of Thermopedantics and Breakfast sense)."
But never mind, the point of this goofy label is just to say that two different names are actually names for the same thing. Or concept, or even restaurant. The way the function works, officially, it is an absolute statement. It is inherently transitive and symmetric, so if we say that "Olecito", the corporation, is the same owls as "Olecito Taqueria", the excellent take-out restaurant, and "Olecito Taqueria" is a name for the same thing somebody once mistakenly typed as "Cielito Taqueria", then we are saying that anything said about any of those is true of the others. And there we are in the CPA's office-park again, annoying him by demanding Jarritos and a torta.
Practitioner arguments about this issue are obscure even by my standards. (You can see one here, but you've been warned.) Towards the end of that scattered argument, though, it occurred to me what I think the official owls need.
Sometimes we really are trying to say that two different names are wholly interchangeable, particularly when we're cleaning up a single dataset. So having a way to say that is necessary and good.
But across datasets what we usually mean is, in English, something more like "For our current purposes, the thing called X in somebody else's dataset can be taken to provide some missing information about Y in our dataset." This is different from the same-owls thing in two essential ways: it's contextual, not absolute; and even more importantly, it's one-way. We are not saying, or even implying, that our information about Y ought to be inserted into anybody's database as information about X or Z or anything. We are making no claim about whether X and Y are the same existentially, or for any other purpose than this.
The relevant web standards are missing this concept. Maybe oddly, maybe not: RDF and OWL arose out of taxonomy, not data-cleanup. So they have a way to say that one whole class of things are derived from another class of things, and that one relationship is a variation on another relationship; but not that one individual data-concept is a contextual elaboration on another.
Adding it, though, would be trivial, at least logically. It could be called "owl:represents", as in "My owl is elsewhere, so I'll send this note with yours." And in the other direction, if you are the recipient of this representation, the relationship would be "owl:isRepresentedBy". The assertion is neither transitive nor symmetric. What I say about your data is completely separate from what you say about mine, or what you say about anybody else's. I have no business saying whether my data matches yours for your purposes, and no desire to have my statements about my purposes conflated with anything broader. I'm only trying to say what I'm saying. More precisely, I'm trying to say only what I'm saying. We each keep our owls, and you know that I used yours to send a message, but that doesn't make my owl your owl, or my message your message, or anything inane like that.
Surely the owls, at least, would appreciate this.
Aligning the variant names is a problem in itself, but it's not the hard one. The hard one is when the concepts themselves don't quite line up. One dataset may be a list of employers, for example, so one "Olecito" is actually a corporate entity with various business properties, including a business address for their accountant's office, and if we show up there wanting a steak quesadilla we will probably (although I don't want to underestimate the guy) be disappointed.
In Linked Data practice (and I'm not currently arguing that you need to know what that is if you don't already), the most common tool for asserting equivalence is a relationship called "owl:sameAs". Out of context this seems like some kind of Star Wars joke by way of Harry Potter: those were not the droids you were looking for, but these are the owls you want. But the "owl:" part is just a bit of pointlessly off-putting data-modeling pedantry in which every single thing you say has to be individually attributed, like saying "These scrambled (in the Epicurious sense) eggs (in the Wikipedia sense) are (in the OED sense) cold (in the Eastern West Virginia State College Department of Thermopedantics and Breakfast sense)."
But never mind, the point of this goofy label is just to say that two different names are actually names for the same thing. Or concept, or even restaurant. The way the function works, officially, it is an absolute statement. It is inherently transitive and symmetric, so if we say that "Olecito", the corporation, is the same owls as "Olecito Taqueria", the excellent take-out restaurant, and "Olecito Taqueria" is a name for the same thing somebody once mistakenly typed as "Cielito Taqueria", then we are saying that anything said about any of those is true of the others. And there we are in the CPA's office-park again, annoying him by demanding Jarritos and a torta.
Practitioner arguments about this issue are obscure even by my standards. (You can see one here, but you've been warned.) Towards the end of that scattered argument, though, it occurred to me what I think the official owls need.
Sometimes we really are trying to say that two different names are wholly interchangeable, particularly when we're cleaning up a single dataset. So having a way to say that is necessary and good.
But across datasets what we usually mean is, in English, something more like "For our current purposes, the thing called X in somebody else's dataset can be taken to provide some missing information about Y in our dataset." This is different from the same-owls thing in two essential ways: it's contextual, not absolute; and even more importantly, it's one-way. We are not saying, or even implying, that our information about Y ought to be inserted into anybody's database as information about X or Z or anything. We are making no claim about whether X and Y are the same existentially, or for any other purpose than this.
The relevant web standards are missing this concept. Maybe oddly, maybe not: RDF and OWL arose out of taxonomy, not data-cleanup. So they have a way to say that one whole class of things are derived from another class of things, and that one relationship is a variation on another relationship; but not that one individual data-concept is a contextual elaboration on another.
Adding it, though, would be trivial, at least logically. It could be called "owl:represents", as in "My owl is elsewhere, so I'll send this note with yours." And in the other direction, if you are the recipient of this representation, the relationship would be "owl:isRepresentedBy". The assertion is neither transitive nor symmetric. What I say about your data is completely separate from what you say about mine, or what you say about anybody else's. I have no business saying whether my data matches yours for your purposes, and no desire to have my statements about my purposes conflated with anything broader. I'm only trying to say what I'm saying. More precisely, I'm trying to say only what I'm saying. We each keep our owls, and you know that I used yours to send a message, but that doesn't make my owl your owl, or my message your message, or anything inane like that.
Surely the owls, at least, would appreciate this.
¶ Inside a Dataset, Relational Density Is Data's Best Friend. Outside a Dataset, It's Too Dark to Read · 29 September 2009 essay/tech
Stefano Mazzocchi (of Freebase) and David Karger (of MIT) have been holding a slow but interesting conversation about data reconciliation. It's been phrased as a sort of debate between two arbitrarily polarized approaches to the problem of cleaning up data so that you don't have multiple variant references to the same real data-point: either you try to do this cleanup "in the data", so that it's done and you don't have to worry about it any more; or you leave the data alone and figure you'll incorporate the cleanup into your query process when you actually start asking specific questions.
But I think this is less of a debate than Stefano and (particularly) David are making it seem. Or, rather, that the real polarization is along a slightly different axis. The biggest difference between Stefano's world and David's happens before you even start to worry about when you're going to start worrying about data cleanup: Freebase is attempting to build a single huge structured-data-representation of (eventually) all knowledge; David's work is largely about building lots of little structured-data-representations of individual (relatively) small blobs of knowledge.
Within a dataset, though, Stefano and David (and I) are in enthusiastic agreement: internal consistency is what makes data meaningful. If your data doesn't know whether "Beatles" and "The Beatles" are one band or two different ones, your answers to any question that touches either one are liable to be so pathetically wrong that people will very quickly stop asking you things. In the classic progression from Data to Information to Knowledge to Wisdom, this is what that first step means: Data is isolated, Information is consolidated. It would be inane to be against this. (Unless, I guess, you were also against Knowledge and Wisdom.)
It's on the definition of "within", though, that most of the interesting issues under the heading of "the Semantic Web" wobble. David says "I argued in favor of letting individuals make their own [datasets] ... and worry about merging them with other peoples data later." He really does mean individuals, and the MIT-developed tool he has in mind (Exhibit) is designed (and only suited) for fairly small datasets. Freebase is officially a database of everything, but of course within this everything are lots of subsets, and Stefano is really talking more about cleaning up these subsets than the whole universe. Reconciling all the artist references to "Beatles" and "The Beatles" from albums and songs is a straightforward practical task both beneficial and probably necessary for asking questions of a song/album/artist dataset, whether it's embedded in a larger one or not. Reconciling the "'The Beatles" who are the artist of the song "Day Tripper" with "The Beatles" who are depicted in cartoon form on a particular vintage lunchbox for sale on eBay, on the other hand, is less conceptually straightforward, and of more obscure practical import.
The thing I'm working on at ITA falls in between Exhibit and Freebase on this dataset axis, both in capacity and design. We handle datasets much bigger than you could put in Exhibit, and allow you to explore and analyze them in ways that Exhibit cannot; but we separate datasets, partly for scalability but even more importantly to specifically keep out of any unnecessary quagmire of trying to reconcile not just the bits of data but the structures.
And my own scouting report, from having done a lot of data-reconciliation in a system designed for it, and in other lives of a lot of more-painful data-reconciliation in various systems not really designed for it, is the same as Stefano's report from his experiences at Freebase: getting from data to information, at scale, is hard. It's mainly hard in ways that machines cannot just solve for you, and anybody who thinks RDF is a solution to data-reconciliation is, as Stefano puts it, confusing model mixability with semantic reconciliation, and has probably not noticed because they've only been playing with toy datasets, which is like thinking you're learning Architecture by building a few birdhouses.
And this is all exactly why I have repeatedly argued for treating the "Semantic Web" technology problem as a database-tech problem first, and a web-tech problem only secondarily. David complains, justifiably, about the pain of combining two small, simple folk-dance datasets he himself cares about. But just as Stefano says in another example, the syntax problems here (e.g. text in a YouTube description box, rather than formally-labeled data records) are fairly trivial compared to the modeling differences. All the URIs and XSL transformations in the world are not going to allow every two datasets to magically operate as one. Some person, without or without tools to magnify their effort, is going to have to rephrase one dataset in the argot of the other.
And to say another thing I've said before again, the fact that RDF doesn't itself solve these problems is not its terminal design flaw. Its terrible flaw is that it isn't even a sufficient foundation upon which to build the tools that would help solve these problems. It takes the right fundamental idea, that the most basic conceptual structure of data is things and the relationships between them, but then becomes so invertedly obsessed with the theoretical purity of this reduction that it leaves a whole layer of needed practical elaboration unbuilt. We shouldn't need abstruse design patterns to get ordered lists, or rickety reasoning engines just to get relationships that go both directions, or endless syntax officiousness that gets us expensive precision with no gain in accuracy.
This effort has let us down. And worse, it has constrained smart people so that their earnest efforts cannot help but let us down. After years and years of waiting, we have no Semantic Web of information, we have only Linked Data, where the word "Linked" is so tenuously justified it might as well be replaced with some pink-ink-drinking Seuss pet, and the word "Data" is tragically all too accurate. We have all these computers, living abbreviated lives of quiet depreciation, filled with the data that should become our wisdom, and yearning, if they are allowed to yearn for even one thing, to be able to tell us what they know.
But I think this is less of a debate than Stefano and (particularly) David are making it seem. Or, rather, that the real polarization is along a slightly different axis. The biggest difference between Stefano's world and David's happens before you even start to worry about when you're going to start worrying about data cleanup: Freebase is attempting to build a single huge structured-data-representation of (eventually) all knowledge; David's work is largely about building lots of little structured-data-representations of individual (relatively) small blobs of knowledge.
Within a dataset, though, Stefano and David (and I) are in enthusiastic agreement: internal consistency is what makes data meaningful. If your data doesn't know whether "Beatles" and "The Beatles" are one band or two different ones, your answers to any question that touches either one are liable to be so pathetically wrong that people will very quickly stop asking you things. In the classic progression from Data to Information to Knowledge to Wisdom, this is what that first step means: Data is isolated, Information is consolidated. It would be inane to be against this. (Unless, I guess, you were also against Knowledge and Wisdom.)
It's on the definition of "within", though, that most of the interesting issues under the heading of "the Semantic Web" wobble. David says "I argued in favor of letting individuals make their own [datasets] ... and worry about merging them with other peoples data later." He really does mean individuals, and the MIT-developed tool he has in mind (Exhibit) is designed (and only suited) for fairly small datasets. Freebase is officially a database of everything, but of course within this everything are lots of subsets, and Stefano is really talking more about cleaning up these subsets than the whole universe. Reconciling all the artist references to "Beatles" and "The Beatles" from albums and songs is a straightforward practical task both beneficial and probably necessary for asking questions of a song/album/artist dataset, whether it's embedded in a larger one or not. Reconciling the "'The Beatles" who are the artist of the song "Day Tripper" with "The Beatles" who are depicted in cartoon form on a particular vintage lunchbox for sale on eBay, on the other hand, is less conceptually straightforward, and of more obscure practical import.
The thing I'm working on at ITA falls in between Exhibit and Freebase on this dataset axis, both in capacity and design. We handle datasets much bigger than you could put in Exhibit, and allow you to explore and analyze them in ways that Exhibit cannot; but we separate datasets, partly for scalability but even more importantly to specifically keep out of any unnecessary quagmire of trying to reconcile not just the bits of data but the structures.
And my own scouting report, from having done a lot of data-reconciliation in a system designed for it, and in other lives of a lot of more-painful data-reconciliation in various systems not really designed for it, is the same as Stefano's report from his experiences at Freebase: getting from data to information, at scale, is hard. It's mainly hard in ways that machines cannot just solve for you, and anybody who thinks RDF is a solution to data-reconciliation is, as Stefano puts it, confusing model mixability with semantic reconciliation, and has probably not noticed because they've only been playing with toy datasets, which is like thinking you're learning Architecture by building a few birdhouses.
And this is all exactly why I have repeatedly argued for treating the "Semantic Web" technology problem as a database-tech problem first, and a web-tech problem only secondarily. David complains, justifiably, about the pain of combining two small, simple folk-dance datasets he himself cares about. But just as Stefano says in another example, the syntax problems here (e.g. text in a YouTube description box, rather than formally-labeled data records) are fairly trivial compared to the modeling differences. All the URIs and XSL transformations in the world are not going to allow every two datasets to magically operate as one. Some person, without or without tools to magnify their effort, is going to have to rephrase one dataset in the argot of the other.
And to say another thing I've said before again, the fact that RDF doesn't itself solve these problems is not its terminal design flaw. Its terrible flaw is that it isn't even a sufficient foundation upon which to build the tools that would help solve these problems. It takes the right fundamental idea, that the most basic conceptual structure of data is things and the relationships between them, but then becomes so invertedly obsessed with the theoretical purity of this reduction that it leaves a whole layer of needed practical elaboration unbuilt. We shouldn't need abstruse design patterns to get ordered lists, or rickety reasoning engines just to get relationships that go both directions, or endless syntax officiousness that gets us expensive precision with no gain in accuracy.
This effort has let us down. And worse, it has constrained smart people so that their earnest efforts cannot help but let us down. After years and years of waiting, we have no Semantic Web of information, we have only Linked Data, where the word "Linked" is so tenuously justified it might as well be replaced with some pink-ink-drinking Seuss pet, and the word "Data" is tragically all too accurate. We have all these computers, living abbreviated lives of quiet depreciation, filled with the data that should become our wisdom, and yearning, if they are allowed to yearn for even one thing, to be able to tell us what they know.