Wednesday, July 8, 2009

URL Shorteners and the Semantics of Redirection

When I worked at Bell Labs in Murray Hill, NJ, it amused me that at one end of the building, the fiber communications people were worrying that no one could ever possibly make use of all the bandwidth they could provide- we would never be able to charge for telephone calls unless they figured out how to limit the bandwidth. At the other end of the building, computer scientists were figuring out how to compress information so that they could pack more and more into tiny bit-pipes. I'm still not sure who won that battle.

When I was part of a committee working on the OpenURL standard, we had a brief discussion about the maximum length URL that would work over the internet. A few years before that, there were some systems on the internet that barfed if a URL was longer than 512 characters, but most everything worked up to 2,000 characters, and we anticipated that that limit would soon go away. So here we are in 2009, and Internet Explorer is just about the only thing that still has a length limit as low as 2083 characters. Along comes Twitter, with a 140 character limit on an entire message, and all of a sudden, the URL's we've been making have become TOO LONG! Just as fast, URL shortening services sprung up to make the problem go away.

The discussion on my last post (on CrossRef and OpenURL) got me interested in the semantics of redirection, and that got me thinking about the shortening services, which have become monster redirection engines. When we say something about a URI that is resolved by a redirector, what, exactly are we talking about?

First, some basics. A redirection occurs when you click on a link and the web server for that link tells your browser to go to another URL. Usually, the redirection occurs in the http protocol that governs how your web browser gets web pages. Sometimes, a redirect is caused by a directive in an html page, or programmed by a javascript in that page. The result may seem the same but the mechanism is rather different, and I won't get into it any further. There are actually 3 types of redirects provided for in the http protocol, known by their status codes as "301" "302" and "303" redirects. There are 5 other redirect status codes that you can safely ignore if you're not a server developer. The 301 redirect is called "Moved Permanently", the 302 is called "Found" and the 303 is called "See Other". Originally, the main reason for the different codes was to help network servers figure out whether to cache the responses to save bandwidth (the fiber guys had not deployed so much back then and the bit squeezers were top dogs). Nowadays the most important uses of the different codes are in search engines. Google will interpret a 301 as "don't index this url, index the redirect URL". A 302 will be interpreted as "index the content at the redirect URL, but use this URL for access". According to a great article on url shorteners by Danny Sullivan, Google will treat a 303 like a 302, but who knows?

Just as 301 and 302 semantics have been determined by their uses in search engines, the 303 has been coopted by the standards-setters of the semantic web, and they may well be successful in determining the semantics of the 303. As described in a W3C Technical Recommendation, the 303 is to be used
... to give an indication that the requested resource is not a regular Web document. Web architecture tells you that for a thing resource (URI) it is inappropriate to return a 200 because there is, in fact, no suitable representation for those resources.
In other words, the 303 is suppoesed to indicate that the Thing identified by the URI (URL) is something whose existence is NOT on the web. Tim Berners-Lee wrote a lengthy note about this that I found quite enjoyable, though at the end I had no idea what it was advocating. The discussion that led to the W3C Recommendation has apperently been extremely controversial, and has been given the odd designation "http-range-14". The whole thing reminds me of reading the existentialists Sartre and Camus in high school - they sounded so much more understandable in French!

As discussed in Danny Sullivan's article, most of the URL shorteners use 301 redirects, which is usually what most users want to happen. An indexing agent or a semantic web agent should just look through these redirectors and use the target resource URL in its index. The DOI "gateway" redirector at dx.doi.org discussed in my previous post uses a 302 redirect. Unless doi's are handled specially by a search engine, it means that the "link credit" (a.k.a. google juice) for a dx.doi.org link will accrue to the dx.doi.org URL rather than the target URL. This seems appropriate. Although I indicated that if you use Linked Data rules the dx.doi.org link identifies whatever is indicated by the returned web page, from the point of view of Search engines, that URI identifies an abstraction of the resource it redirects to. A redirection service similar in conception, PURL, also uses 302 redirects.

I was curious about the length limits of the popular url shorteners. Using a link to this blog, padded by characters ignored by Blogger.com, I shortened a bunch of long URLs. Here are 4 shortened 256 character links to this blog:
They all work just fine. Moving to 1,135 character links, everything still works (at least in my environment):
At 2083 characters, the limit for Internet Explorer, we start separating the redirection studs from the muffins.
When I add another character, to make 2,084 total, bit.ly and snurl.com both work, but blogger.com reports an error!
The compression ratios for these last two links is 109 to 1 for bit.ly and 95 to 1 for snurl. The bit squeezers would be happy.

Next, I wanted to see if I could make a redirection loop. Most of the shortening services decline to shorten a shortened URL, but they're quite willing to shorten a URL from the PURL service. Also, I couldn't find any way to use the shortening services to fix a link that had rotted after I shortened it. It could be useful to add the PURL service as link-rot insurance behind a shortened url if the 302 redirect is not an issue. So here's a PURL: http://purl.oclc.org/NET/backatcha that redirects to http://bit.ly/aE0od which redirects to http://purl.oclc.org/NET/backatcha etc. Don't click these expecting an endless loop- your browser should detect the loop pretty fast.

A recent article about how bit.ly is using its data stream to develop new services got me thinking again about how a shortening redirector might be useful in Linked Data. I've written several times that Linked Data lacks the strong attribution and provenance infrastruction needed for many potential applications. Could shortened URIs be used as Linked Data predicates to store and retrieve attribution and provenance information, along with the actual predicate? And will I need another http status code to do it?


Article any source

No comments:

Post a Comment