Thursday, May 13, 2010

A Long Handle on Shortened Digital Object Identifiers

Google has launched a social networking site called Orkut. Already the site has over 100 million members worldwide! What's that? You haven't heard of it? No it's not new, it's actually more than 6 years old. The 100 million users- they're mostly in Brazil and India.

You might be asking yourself, "What is Google doing running a social networking site for Brazil and India?", but a better question might be "What do you do in life when you take a home run swing and manage a nub single?"

The technology behind the Digital Object Identifier has proven to be one of these infield hits. It's called the "Handle" system, and it was developed fifteen years ago by computer scientists who feared that the DNS system that maps host names to IP addresses on the internet would fail to scale for the very large numbers of digital objects needed in future digital library applications. Bob Kahn, who with Vint Cerf designed the Transmission Control Protocol (TCP) underlying today's internet, was the driving force behind the Handle system. It implements a highly scaleable distributed naming infrastructure for digital objects. Kahn's organization, CNRI, then developed the Digital Object Identifier to serve as a core application of the Handle System.

The Digital Object Identifier, or DOI, was designed and promoted as a solution to two problems, broken URLs and rights management. The fact that fifteen years later broken URLs are still a problem and digital rights are a quagmire suggests that the DOI has had limited impact in these areas. Did something go wrong?

Fifteen years ago, Google did not exist. Netscape was just a year old. The developers of the Handle system hoped that Handle resolution would get built into web browser software alongside DNS. That never really happened, perhaps because Netscape and Microsoft had development roadmaps for web servers and browsers that diverged from CNRI's vision. To allow Handles to work in unmodified browser software, CNRI was forced to implement a proxy service that connected the system of DNS resolution to the system of handle resolution via a web server. This proxy, at http://dx.doi.org/, serves almost all of the world's DOIs. A link pointed at the DOI proxy gets redirected to a web address designated by the owner of the digital object or perhaps to a library with DOI-enabled software. This redirection capability gives publishers the flexibility to move digital objects from one address to the other without breaking links, even if the object is traded to another publisher using a completely different website.

Things have changed in 15 years. The World Wide Web turned out to be not so interested in digital objects with complex management systems and rights regimes. Instead, the World Wide Web turned out to be an unthinkably large number of web pages with neglible rights management indexed by search engines. Persistence of links turned out to be less important than the findability of content in search engines.

Since search engines never bothered to learn about the Handle system, the DOI proxy turned out to be much more important than the Handle resolution system that hides behind it. Details that were inconsequential 15 years ago have become important. One of these details is the DOI proxy's HTTP status code. This code tells a requestor the meaning of the redirect URL. As I've previously written, the DOI proxy returns a 302 status code. The Google index inteprets this to mean that it should assign indexing weight to the URL beginning with "http://dx.doi.org/", and not the publisher-specified URL. 302 is the is the correct code for the DOI proxy, because if the publisher-specified URL changes, the publisher doesn't want to lose all the "link juice" it has earned by being linked to from other sites.

The DOI has not been generally adopted by the web at large, with an important exception, CrossRef. CrossRef added some useful machinery onto the DOI and turned it into an absolutely essential service for publishers of ejournals other types of content that need to embed persistent links. CrossRef accounts for 96% of all registered DOIs and DOI resolutions (about 60 million per month).

60 million resolutions per month might seem like a lot of traffic, but it's not so big on the scale of today's web. Google delivers that many searches in 3 hours. A more direct comparison would be bit.ly, the URL shortening service, which reported 3.4 BILLION resolutions in March, or 4.6 million resolutions per hour.

The staggering popularity of URL shortening services such as bit.ly prompted CrossRef's Geoffrey Bilder to propose last year a similar service for DOI links. Evidently, the folks at the International DOI Foundation agreed that this was a good idea, because last week, they launched the "shortDOI" service.

ShortDOI is meant to address a shortcoming of DOIs- their length and ugliness. When DOI started, no one could have imagined that URLs would appear prominently in boxes of children's cereal, as they do today. It was assumed that they would be hidden in links and be used exclusively by machines. The original spec for the DOI string even allowed DOIs to include spaces and non-printing Unicode characters! Digital object owners were free to choose ID strings that were long and full of punctuation, even punctuation that was incompatible with web pages. ShortDOI uses a small number of alphanumeric digits to do away with all the DOI ugliness. It also does away with the publisher prefix, which hasn't been much use anyway. So instead of 10.1103/PhysRevLett.48.1559 or 10.1002/(SICI)1097-0258(19980815/30)17:15/16<1661::AID-SIM968>3.0.CO;2-2 shortDOI lets us use URLs like http://doi.org/aa9 and http://doi.org/aabbe.

ShortDOI can't quite decide whether it's a DOI or a shortener. Like DOI, it's implemented using the Handle system. Instead of redirecting through the DOI proxy, shortDOI acts as an alternate proxy, and uses the same 302 redirects that the long DOI proxy uses. From the perspective of search engines, a shortDOI is a new object to be ranked separately from the DOI. The link juice earned by a short DOI won't accrue to the DOI it has shortened.

Like a shortener, shortDOI assigns codes sequentially, making it easy for robots to harvest content identified by shortDOIs. ShortDOI allows anyone to create the shortened URL, but provides none of the tracking, statistics, spam protection and malware rejection offered by other shortener services. Library OpenURL servers don't yet work with shortDOI, even though the shortDOI proxy will try to send shortDOI handles to them.

The implementation choices made for shortDOI aren't mistakes- they make perfect sense in the context of the Handle naming architecture. Nonetheless, the difficulties they present for libraries and search engine optimization highlight the Handle system's misfit with the dominant link resolution architecture of the web.

The Handle system has been very well designed and managed. I've written software that uses the Handle software libraries and I found them to be elegant and easy to work with. The principles and algorithms built into Handle system are very similar to those used years later inside Google's internal file system or by any number of other large digital object management systems.

The Handle System is relatively inexpensive, but the costs are now higher than the large scale URL shorteners. According to public tax returns, the DOI Foundation pays CNRI about $500,000 per year to run the DOI resolution system. That works out to about 0.7 cents per thousand resolutions. Compare this to Bit.ly, which has attracted $3.5 million of investment and has resolved about 20 billion shortened links- for a cost of about 0.2 cents per thousand. It remains to be seen whether bit.ly will find a sustainable business model; competing directly with DOI is not an impossibility.

What do you do with infrastructure that has been successful in Brazil or scholarly publishing but not elsewhere? Do you keep it alive in hopes that after twenty years, some unforeseen circumstance will result in its overnight universal adoption? Do you scale back, phase out or sell out in favor of more cost effective alternatives? Or do you just do your best to continue serving loyal users? I don't know the answer, but I do know that in baseball and cricket you've got to run the bases to score.

Update: Corrected CrossRef share of DOI resolutions.

Reblog this post [with Zemanta]

Article any source

No comments:

Post a Comment