Re: Google Magicians?

From: B.G. Sloan <bgsloan2_at_nyob> Date: Mon, 21 Sep 2009 12:25:06 -0700 To: NGC4LIB_at_LISTSERV.ND.EDU

Trish Culkin said: "...we'd be better served by continuing the pressure on Google to 1) understand it and then 2) use it."

So who's pressuring Google to do this?

Bernie Sloan

--- On Mon, 9/21/09, Trish Culkin <trish.culkin_at_GMAIL.COM> wrote:

From: Trish Culkin <trish.culkin_at_GMAIL.COM>
Subject: Re: [NGC4LIB] Google Magicians?
To: NGC4LIB_at_LISTSERV.ND.EDU
Date: Monday, September 21, 2009, 2:25 PM

Tarring the data "trapped" in the MARC format seems an oversimplification.
I know the format is a bear, and that not all MARC records are created
equal,  but in general MARC is literally a resource without peer.

Setting aside  questions of classification and LCSH vs BISAC, it's hard to
argue that "juried" MARC records -- those coming from LC, OCLC and major
academic and public libraries -- do not in general contain good descriptive
cataloging -- i.e.  accurate representation of authorship, place and date of
publication, edition, language, etc.  These descriptive facets were the
first focus of Geoffrey Nunnerg's Google slam (e.g. all the bad dates) and
it seems counterproductive to argue that that these good MARC records not
worth matching correctly to Google digital editions.

For the record, I also believe that computer manipulation of the
classification embedded in MARC (both Dewey and LC) and of LC Name and
Subject Headings combined with a  good authority file, would add value to
both casual search and retrieval as well as to rigorous scholarly work.
Combined with tagging, text analysis, user-participatory efforts and great
graphics, the potential for using computers to help the world understand its
intellectual heritage is tremendous.

Bottom line, the information contained in MARC records from established
sources represents a 200-year heritage of good-faith professional effort to
describe intellectual works and place them in intellectual context. It's not
only the best we have, it's all we have, and rather than discard it in the
hopes that "better" data will emanate from somewhere, we'd be better served
by continuing the pressure on Google to 1) understand it and then 2) use it.

On Mon, Sep 21, 2009 at 10:34 AM, Jonathan Rochkind <rochkind_at_jhu.edu>wrote:

> I completely understand the power of good metadata.  I know a decent (just
> decent, admitted) amount about MARC and AACR2 due to excellent preperation
> in library school in classes from Alysson Carlyle, and a three year career
> of spending significant time talking to catalogers, reading about
> cataloging, working with MARC and AACR2 data, and reading cataloging
> standards. Sure, I don't know as much as an expert cataloger with 20 years
> experience, but I'm not a babe in the woods.
>
> I still find it very difficult to get all but the most trivial data out of
> our _actual_ in practice MARC corpuses, in a way that will actually be
> consistent and useful to the users.
> I know dozens of people who agree with me, including catalogers, catalogers
> with decades of experience (talk to Diane Hillman, I don't think anyone can
> say she doesn't understand cataloging or respect good metadata), and around
> 10 people who have posted to this list.   Certainly reasonable people can
> disagree though, sure.
>
> I resent this being portrayed as a debate between those who understand the
> power of good metadata and those who don't. I understand the power of good
> metadata, I just wish we had more of it.
>
> Jonathan
>
>
> Trish Culkin wrote:
>
>> I think it *IS *more difficult that it should be, and hence more
>> expensive,
>> to convince system designers and software engineers to work with the
>> intricacies and embedded intelligence of AACR2/MARC Meta data.  In over 25
>> years of managing crews of developers in two different ILS companies, I
>> found that their tendency was always to "rethink" or "reinvent", or at
>> least
>> "simply" the application and use of MARC data, and this is likely true at
>> Google today.
>>
>> This was probably originally an off-shoot of the "not invented here"
>> syndrome, but now I think it's more a matter of AACR2/MARC's complexity
>> not
>> being transparent and not easily succumbing to manipulation by standard
>> tools. Developers typically expect the data to fit into more traditional
>> (and simpler) data-models, and it's hard to entice them (or their business
>> managers)  into deconstructing another universe prior to writing new
>> applications.
>>
>> This is notwithstanding Jane's description of currently available options
>> for manipulating data -- the use and value is obvious to those in the
>> library trade, but not so much outside this venue and it kind of makes her
>> Catch 22  point: "... those who have cataloging/bibliographic knowledge
>> lack
>> computing knowledge/server space. Those who have computing
>> knowledge/server
>> space probably lack cataloging/bibliographic knowledge."
>>
>> If the objective is to use this data to its fullest potential, and if past
>> experience is any indicator, it will require a mix of  pressure from
>> skilled
>> users, informed persistence from inside and outside Google to counter
>> profit
>> objectives, and many iterations to achieve something approximating
>> responsible use.
>>
>> I'm not sure whether it's sad or validating to watch this struggle between
>> those who understand the power of good meta data struggle with those who
>> have the skills to make best use of it. Both, I guess.
>>
>>
>> On Mon, Sep 21, 2009 at 9:39 AM, Jacobs, Jane W <
>> Jane.W.Jacobs_at_queenslibrary.org> wrote:
>>
>>
>>
>>> Jonathan Rochkind Wrote:
>>>
>>>
>>>
>>>> All I can say is that I and every other programmer in libraries that I
>>>>>
>>>>>
>>>> know that has tried to work with AACR2/MARC metadata has found that it
>>> is not nearly as simple as you say to identify data elements of
>>> interest.   Despite our familiarity with the relevant standards, such as
>>>
>>> they are.
>>>
>>> ...
>>>
>>>
>>>
>>>> All I can
>>>>>
>>>>>
>>>> say is the only people I know that think "it should be easy to get
>>> whatever data you want out of library MARC" are people who aren't
>>> programmers who have tried.
>>>
>>> I'm not much of a programmer, but using the open-source Perl module,
>>> developed by REAL programmers (really GOOD programmers, I would add.)
>>> I've managed to pull out pretty much everything what I needed.  On the
>>> rare occasions when we needed and were able to hire a real programmer
>>> the results were excellent.
>>>
>>> If I were a real programmer and didn't want to dip into the Perl module
>>> to grab what I wanted, I would probably want to use XML, there are
>>> already programs to convert MARC to MARC-XML.  MARC-XML is pretty
>>> verbose and cludgey in terms of taking up space on your servers but if
>>> you have plenty server space to stash it on it's no problem.  Grabbing
>>> things out of XML, even the cludgey MARC kind is quite easy, as long as
>>> you know where you're grabbing from.
>>>
>>> Ironically those who have cataloging/bibliographic knowledge lack
>>> computing knowledge/server space. Those who have computing
>>> knowledge/server space probably lack cataloging/bibliographic knowledge.
>>> Catch-22!
>>>
>>> However on the following point I expect you're totally correct!
>>>
>>>
>>>
>>>> Google may have much more resources than any one of our libraries do,
>>>>
>>>>
>>> but they still choose to expend them or not based on cost benefit.  I
>>> still suspect Google's estimate of the 'cost' is higher than you think
>>> it is, AND that their estimate of the 'benefit' of using library data is
>>>
>>> lower than you think it is.
>>>
>>> JJ
>>>
>>>
>>> **Views expressed by the author do not necessarily represent those of
>>> the Queens Library.**
>>>
>>> Jane Jacobs
>>> Asst. Coord., Catalog Division
>>> Queens Borough Public Library
>>> 89-11 Merrick Blvd.
>>> Jamaica, NY 11432
>>> tel.: (718) 990-0804
>>> e-mail: Jane.W.Jacobs_at_queenslibrary.org
>>> FAX. (718) 990-8566
>>>
>>>
>>>
>>> The information contained in this message may be privileged and
>>> confidential and protected from disclosure. If the reader of this message
>>> is
>>> not the intended recipient, or an employee or agent responsible for
>>> delivering this message to the intended recipient, you are hereby
>>> notified
>>> that any dissemination, distribution or copying of this communication is
>>> strictly prohibited. If you have received this communication in error,
>>> please notify us immediately by replying to the message and deleting it
>>> from
>>> your computer.
>>>
>>>
>>>
>>
>>
>>
>>
>>
>

-- 
Trish