Re: perl-based federated search

From: Walter Lewis <lewisw_at_nyob>
Date: Wed, 12 May 2004 15:40:40 -0400
To: CODE4LIB_at_LISTSERV.ND.EDU
rob caSSon wrote:

>[snip]its very incomplete, but if anyone feels like taking a look, here it is:
>there is a tarball with the cgis, php scripts, and example result lists
>for the dbs
>
>i'm mainly posting this in case someone can figure out how better to
>parse some of the database search result lists (jstor has been
>particularly problematic)....oh, how i long for xml output.....
>
>the current scrapers i've got are ebsco, jstor, dataware (ohio-centric),
>and the III catalog....i'll tackle lexis, and a few others in the near
>future.....
>
>anyway, comments/help appreciated....i'm using perl, lwp,
>html::treebuilder, cgi, and uri, and my perl is rusty at best....
>
I did a couple of proof-of-concept things over the winter using php and
curl, which is available on a number of platforms.  I can't comment on
how well it performs relative to the configuration you're using but
libcurl is a smart,  reasonably well supported toolset.  In my
configuration, I attempted to derive a number of results from the first
screen back using a simple regular expression.  Failing that I simply
put up a "success" flag, largely based on the *absence* of the target's
variation on "no records could be found..." message.

In terms of the navigation metaphor [note to those using the archives:
this URL is subject to change]: try
   http://roy.halinet.on.ca/GreatLakes/Search/search.php
A search like "Chicora" returns a meaningful, but not overwhelming set
of results from a number of targets including BGSU in Ohio.

When I tackled Ebsco, I ran into issues of site authentication via
cookies that were passed to the search gateway but not on to the client
browser. Peter Binkley, at the University of Alberta recommended a proxy
configuration to balance off this issue.  Essentially those connections
would have to continue to operate inside a search gateway proxied session.

I don't know how the perl tools stack up in terms of parallel search
streams.  The php/curl combination is purely serial and the last targets
will time out if there is a tardy responder in the middle of the serial
queue.  Art Rhyno, at the University of Windsor, suggested a parallel
approach might be possible in a Cocoon environment.  This has the
advantage of passing all the inbound HTML pages through JTidy and giving
you the XHTML/XML compliant input stream you wanted (in most cases, even
when the output from the target was some distance from compliance).

Walter Lewis
Halton Hills
Received on Wed May 12 2004 - 14:48:47 EDT