Thursday, June 4, 2009

When are you collecting too much data?

Sometimes it can be useful to be ignorant. When I first started a company, more than 11 years ago, I decided that one thing the world needed was a database, or knowledgebase, of how to link to every e-journal in the world, and I set out to do just that. For a brief time, I had convinced an information industry veteran to join me in the new company. One day, as we were walking to a meeting in Manhattan, he turned to me and asked "Eric, are you sure you understand how difficult it is to build and maintain a big database like that?" I thought to myself, how hard could it be? I figured there were about 10,000 e-journals total, and we were up to about 5000 already. I figured that 10,000 records was a tiny database- I could easily do 100,000 records even on my Powerbook 5400. I thought that a team of two or three software developers could do a much better job sucking up and cleaning up data than the so-called "database specialists" typically used by the information industry giants. So I told him "It shouldn't be too hard." But really, I knew it would be hard, I just didn't know WHAT would be hard.

The widespread enthusiasm for Linked Data has reminded me of those initial forays into database building. Some important things have changed since then. Nowadays, a big database has at least 100 million records. Semantic Web software was in its infancy back then; my attempts to use RDF in my database 11 years ago quickly ran into hard bits in the programming, and I ended up abandoning RDF while stealing some of its most useful ideas. One thing that hasn't changed is something I was ignorant of 11 years ago- maintaining a big database is a big, difficult job. And as a recently exed ex-physicist, I should have known better.

The fundamental problem of maintaining a large knowledgebase is known in physics as the second law of thermodynamics, which states that the entropy of the universe always increases. An equivalent formulation is that perpetual motion machines are impossible. In terms that non-ex-physicist librarians and semantic websperts can understand, the second law of thermodynamics says that errors in databases accumulate unless you put a lot of work into them.

This past week, I decided to brush off my calculus and write down some formulas for knowledgebase error accumulation so that I wouldn't forget the lessons I've learned, and so I could make some neat graphs.

Let's imagine that we're making a knowledgebase to cover something with N possible entities. For example, suppose we're making a knowledgebase of books, and we know there are at most one billion possible books to cover. (Make that two billion, a new prefix is now being deployed!) Let's assume that we're collecting n records at random, so for any entity, each record has a 1/N chance of covering any specific entity. At some point, we'll start to get records that duplicate information we've already but into the database. How many records, n, will we need to collect to get 99% coverage? It turns out this is an easy calculus problem, one that even a brain that has spent 3 years as a middle manager can do. The answer is:
Coverage fraction, f = [1- exp(-n/N)]
So to get 99% of a billion records, you'd need to acquire about 4.6 billion records. Of course there are some simplifications in this analysis, but the formula gives you a reasonable feel for the task.

I haven't gotten to the hard part yet. Suppose there are errors in the data records you pull in. Let's call the the fraction of records with errors in them epsilon, or ε. Then we get a new formula for the errorless coverage fraction, F
Errorless coverage fraction, F = [exp(-εn/N) - exp(-n/N)]
This formula behaves very differently from the previous one. Instead of rising asymptotically to one for large n, it rises to a peak, and then drops exponentially to zero for large n. That's right, in the presence of even a small error rate, the more data you pull in, the worse your data gets. There's also a sort of magnification effect on errors- a 1% error rate limits you to a maximum of 95% errorless coverage at best; a 0.1% error rate limits you to 99.0% coverage at best.

I can think of three strategies to avoid the complete dissipation of knowledgebase value caused by accumulation of errors.
  1. stop collecting data once you're close to reaching the maximum. This is the strategy of choice for collections of information that are static, or don't change with time.
  2. spend enough effort detecting and resolving errors to counteract the addition of errors into the collection.
  3. find ways to eliminate errors in new records.
A strategy I would avoid would be to pretend that perpetual motion machines are possible. There ain't no such thing as a free lunch.
Article any source

No comments:

Post a Comment