Re: Copernicus, Cataloging, and the Chairs on the Titanic, Part 1 [Long Post]

From: Stephen Paling <paling_at_nyob> Date: Mon, 5 Jul 2010 09:21:17 -0500 To: NGC4LIB_at_LISTSERV.ND.EDU

Jim Weinheimer wrote:

> I feel like giving a more conservative view:
> 
> One of the main points that is simply taken for granted is that it is 
> simply not possible to "catalog" the worthwhile materials on the 
> Internet. There is too much and it beyond everyone's capabilities. I 
> have never seen any research attached to this assertion but I have 
> never really believed it. 

Actually, there has been research, and the picture isn't pretty for people who advocate cataloging even significant portions of the Internet. And by the way, that's not a straw man. I still encounter people who want to “catalog the Web.”

Berkeley's SIMS has done several estimates of the amount of information now available. The latest estimate is for 2003, and you can find it at http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/. Let's use their estimate in a little thought experiment. I'll use the executive summary, available at http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm.

You raise a good point that there is a lot of information (babies burping!) that we don't need to worry about. So let's make some generous adjustments in favor of print. Let's take the upper estimate for print (Section III, Table 1.2), which is 1,634 TB worth of information. Let's also assume for the purposes of the game that we want to organize all of that information, including archiving all of the office documents. Let's also assume that film and optical media fall under BBLOs (Books and Book-Like Objects) and are already being cataloged. We'll only include magnetic media in our digital total (Section III.C, Table 1.6), and use the lower bound estimate (3,416,230 TB). Let's be even more generous and lop that number in half to 1,708,115 TB. Those are very generous adjustments for our thought experiment. So what does it leave us with? A ratio of digital to paper-based information of 1,045:1. Just for the heck of it, let's knock that ratio down by another order of magnitude 
to 104.5:1.

So, even with these ~very~ generous adjustments, we're still faced with a better-than-hundredfold increase in our workload. That's not 100%, that's 100-fold. For every book we catalog now, we would have to catalog an additional 104 items. If a cataloger currently catalogs a book every 15 minutes, give that cataloger some coffee, because she or he will have to catalog an item every 8.6 seconds. Holy hot sauce! Raise your hand if your cataloging department can absorb that extra work load. 

But wait! There's more! We need to think about unitization. A book is a very large chunk of information. Most Web pages take up considerably less space on storage media than a book. You also need to decide at what level you want to aggregate Web content: the site, the page, the file (.htm, .gif, etc.). Does anyone still think this is a tractable problem?

This was just a thought experiment, and real mileage would obviously vary. But remember, it could vary downward ~or~ upward. When people like me say that the information solar system doesn't revolve around books, this is the kind of thing we mean.

Steve

=====================================
Stephen Paling
Assistant Professor
School of Library and Information Studies
4251 Helen C. White Hall
600 N. Park St.
Madison, WI 53706-1403
Phone: (608) 263-2944
Fax: (608) 263-4849
paling_at_wisc.edu