Friday, October 23, 2009

Copyless Crowdscanning: How to Legally Index the World's Books

Here's how I know that I have engineering in my DNA. Whenever I hear something labeled as impossible, impractical or unlawful, I can't restrain myself from trying to think of ways around the physical, logistical and legal constraints that supposedly imply impossibility. "That", "is" and "impossible" are fighting words to an engineer. And that's why I've admired the proposed Google Books Settlement. By way of a spectacular feat of legal engineering, it has suggested a way to do the seemingly impossible- to build a database of all the worlds books- in the face of the tremendous obstacle posed by an extremely messy legal situation.

But despite my admiration for the "engineering" involved in the settlement, there have always been some things I didn't like about it. And despite all that's been written about it, and the many aspects that people people have objected to, I've never seen anyone voice my particular misgivings, perhaps because of their peculiar engineer's orientation.
  1. The settlement uses a legal innovation to accomplish its goals. I don't like that (the "legal" part, not the "innovation" part). Many people have objected to the particular innovation that is used, arguing that this precedent could lead to a reign of tyranny and/or other cataclysm, but I've not seen any objection to the use of legal apparatus in the first place. I've often made the disclaimer here that I Am Not A Lawyer, but I've generally downplayed my ingrained bias for using technology rather than law to solve the world's problems.
  2. The settlement seems to be based on a presumption that Google's database of all the world's books cannot be built without making copies. I don't like to assume things are impossible. I should also note that several of the arguments opposing the Google Books Settlement rely on exactly the same presumption!
As the months have dragged on and the postponements pile up, I'm thinking that my first objection is starting to make more and more sense. After thinking it over for over 6 months I'm starting to think that my second objection is also valid. The rest of this post describes how it might be possible to build a full-text database of all the worlds' books without doing any copyright-infringing copying. I'll call this scheme "Copyless Crowdscanning".

What got me started on this line of thought were some simple cost calculations I presented in my article on Dan Reetz' DIY book scanner. It made me realize that the idea of having hundreds of thousands of people scanning their books with cheap scanners was not out of the realm of possibility. The main barrier to assembling a database of all the world's books will no longer be the scanning, but rather the laws governing copyright. So my focus is on how to do crowdscanning so that copyrights are not infringed; the easiest way to do that is to not make any copies.

Here are the assumptions I start with. As I've been learning about copyright, I've learned that there will always be a copyright lawyer somewhere willing to contest any common-sense assumption about copyright, so it's important to start somewhere. First, I'm assuming that scanning a small number of pages of a book (suppose that number is 1% of the book) for the purpose of indexing those pages is not a violation of copyright, as long as I don't redistribute the scans and destroy them after I finish my indexing. The indices are things I should be able to keep and redistribute.

Second, I'm assuming that it is not a violation of copyright to redistribute single sentences from a book. So, for example, publishing the following sentence:
The punishment lay in knowing that you were putting all of that effort into letting a kind of intellectual poison infiltrate your brain down to its very roots.
is not a violation of Neal Stephenson's copyright to the book Anathem. A corollary of that is that if I shuffle the order of all the sentences in a book, I can redistribute that jumble without violating copyright.

Finally, I'm assuming that scanning and distributing the title page of a book and its verso cannot be a violation of copyright; such distribution would be necessary in many cases just to convey statements of fact and as such are not subject to copyright. I recognize that artwork on these pages may need excision.

Let's suppose that we had a large number of people participating in our database building project. Suppose for example, that 100,000 people participated. Each person would scan a small fraction of each book they owned, along with its title pages. The title pages would be submitted to a book identity server, which would return a book identifier. The rest of the page scans would be processed by software, and the scans would then be destroyed. The software would digitize the scans, then chop the pages into individual sentences. An index of the pages would be generated and submitted to an "index aggregation" service. The sentences would be shuffled and submitted to a "sentence serving" service.

After many people have made partial scans and submitted partial indices to the index aggregator, a complete index would emerge that can be used just as Google Book Search is used. The complete sentences would be provided by the sentence server to provide the context of the result sets.

Note neither the index aggregator nor the sentence server would be able to reconstitute a book or even the pages from a book. It seems to me that it should be possible to add some encrypted information and send the keys to yet another party so as to allow reconstitution of the pages in authorized circumstances, such as for use by people with disabilities. If you can't use the information to reconstitute the book, then it seems to me that no copy exists and no copyrights have been infringed.

If my assumptions are incorrect, then I should expect that Harper-Collins will soon be suing me for copyright infringement. I'll be sure to let you know. If they are correct, but there's some theory that would expose any of the crownscanning participants to liability, then perhaps someone who Really-Is-A-Lawyer could elaborate in the comments. I recognize that copyless crowdscanning wouldn't be applicable without modification to things like art books, artwork in books, poetry collections, sheet music, periodicals, reference works, but it would be a start. And it would make some engineers happy.

Update: Several people (including real lawyers) have commented to me that crowdscanning would not help much as an infringement defense if the result of the entire system had the effect of making the entire text available. I just want to emphasize that I think a system can be engineered so as to enable indexing while preventing text reconstruction and avoiding the use of copies.
Reblog this post [with Zemanta]

Article any source

No comments:

Post a Comment