Thursday, October 29, 2009

The New York Times Blunders Into Linked Data, Pillages Freebase and DBPedia

Notwithstanding Larry Lessig, when you you try to use the precision of code to express squishiness of the legal system, you are bound to run into problems, as I've explored in my posts on copyright.

This Thursday, the New York Times took advantage of the International Semantic Web Conference to make good on their previous promise to begin releasing the New York Times subject index as Linked Data. No matter how you look at it, this is a big advance for the semantic web and the Linked Data movement. It's also a potential legal disaster for the New York Times.

To understand the what the New York Times did wrong, you have to understand a little but about the workings of RDF, the data model underlying the semantic web. In particular, you have to understand about entailment. Entailments are the sets of facts that can be deduced from the meaning of semantic web data. The crucial difference between plain-old data and Linked Data is that Linked Data includes these entailments.

Consider the English-language statement "apples are red". Because it is expressed in a language, it has meaning in addition to the single fact that apples are red. If we also assert that a specific object is an apple, then there is an entailment that the object is also red.

The New York Times Linked Data is expressed in the RDF language and uses vocabularies called OWL, SKOS, Dublin Core, and Creative Commons (denoted here by the prefixes "owl:", "skos:", "dc:" or "dcterms:", and "cc:"). You can download it yourself at http://data.nytimes.com/people.rdf (11.9 MB download)

Here's a simplified bit of the New York Times Linked Data. It defines a concept about C. C. Sabathia, a baseball pitcher who lost a game on Wednesday for the New York Yankees:
<rdf:Description rdf:about="http://data.nytimes.com/N24334380828843769853">
<skos:prefLabel>Sabathia, C C</skos:prefLabel>
<owl:sameAs rdf:resource="http://dbpedia.org/resource/CC_Sabathia"/>
<owl:sameAs rdf:resource="http://rdf.freebase.com/rdf/en.c_c_sabathia"/>

<dc:creator>The New York Times Company</dc:creator>
<cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
<dcterms:rightsHolder>The New York Times Company</dcterms:rightsHolder>
<cc:attributionName>The New York Times Company</cc:attributionName>
</rdf:Description>
The first thing this does is it creates an identifier, "http://data.nytimes.com/N24334380828843769853", for the "C. C. Sabathia" subject concept. The New York Times uses this set of subjects to create topic pages, and the main purpose of releasing this data set is to help people link concepts throughout the internet to the appropriate New York Times topic pages.

Next, it gives a label for this concept, "Sabathia, C C". So far so good. The next two statements say that the New York Times Topic labeled by "Sabathia, C C" is the same concept previously identified by DBPedia, a Linked Data version of Wikipedia, and by Freebase, another large collection of Linked Data. This is even better, because this tells us that we can use information from Wikipedia and Freebase to help us infer facts about the New York Times C. C. Sabathia topic. "sameAs" is term is defined as part of the "OWL" standard vocabulary, which defines how machines should process these assertions of sameness.

The last four lines, highlighted in red, assert that the C. C. Sabathia concept was created by "The New York Times Company", which is the rights holder for the C. C. Sabathia concept, and that if you want to use the C. C. Sabathia concept, the The New York Times Company will license the concept to you under the terms of a particular Creative Commons License.

There are two separate blunders made by the stuff in red. The first blunder is that the New York Times is attempting to say that the C. C. Sabathia concept is a work "PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW." This is complete rubbish. The information provided by the New York Times about the C. C. Sabathia concept consists of a few facts that cannot be protected by copyright or any other law that I know of. (The entire 5,000 entity collection, however, is probably protectable in countries other than the US).

The second blunder is much worse. Where the first blunder is merely silly, the second blunder is akin to attempted property theft. Because the New York Times has asserted that it holds the rights to the C. C. Sabathia topic, and further, that the C. C. Sabathia topic is the same as the Freebase "c_c_sabathia" topic and the Wikipedia "CC_Sabathia" topic, by entailment, the New York Times is asserting that is the rights holder for those concepts as well.

You might argue that this is a harmless error. But in fact, there is real harm. Computers aren't sophisticated enough to deal with squishy legal concepts. If you load the New York Times file into an OWL-aware data store, the resulting collection will report that the the New York Times Company is the rights holder for 4,770 concepts defined by Wikipedia and 4,785 concepts defined Freebase.

Now before you start bashing the New York Times, it's important to acknowledge that RDF and Linked Data don't make it particularly easy to attached licenses or attributions to semantic web data. The correct ways to do this are all ugly and not standardized. You would think that this would be a requirement for commercial viability of the semantic web.

People trying to use New York Times Linked Data can deal with this in three ways. They can decide not to use data from the New York Times, they can ignore all licensing and attribution assertions that the Times makes, or they can hope that the problem goes away soon.

A fourth way would be to sue the New York Times Company for damages. At long last there's a lucrative business model for Linked Open Data.

Update: I have two follow-up posts: The Blank Node Bother and the RDF CopyMess and The New York Times Gets It Right; Does Linked Data Need a Crossref or an InfoChimp?
Reblog this post [with Zemanta]

Article any source

No comments:

Post a Comment