Re: perl-based federated search

From: Walter Lewis <lewisw_at_nyob> Date: Wed, 12 May 2004 15:40:40 -0400 To: CODE4LIB_at_LISTSERV.ND.EDU

rob caSSon wrote:

>[snip]its very incomplete, but if anyone feels like taking a look, here it is:
>there is a tarball with the cgis, php scripts, and example result lists
>for the dbs
>
>i'm mainly posting this in case someone can figure out how better to
>parse some of the database search result lists (jstor has been
>particularly problematic)....oh, how i long for xml output.....
>
>the current scrapers i've got are ebsco, jstor, dataware (ohio-centric),
>and the III catalog....i'll tackle lexis, and a few others in the near
>future.....
>
>anyway, comments/help appreciated....i'm using perl, lwp,
>html::treebuilder, cgi, and uri, and my perl is rusty at best....
>
I did a couple of proof-of-concept things over the winter using php and
curl, which is available on a number of platforms.  I can't comment on
how well it performs relative to the configuration you're using but
libcurl is a smart,  reasonably well supported toolset.  In my
configuration, I attempted to derive a number of results from the first
screen back using a simple regular expression.  Failing that I simply
put up a "success" flag, largely based on the *absence* of the target's
variation on "no records could be found..." message.

In terms of the navigation metaphor [note to those using the archives:
this URL is subject to change]: try
   http://roy.halinet.on.ca/GreatLakes/Search/search.php
A search like "Chicora" returns a meaningful, but not overwhelming set
of results from a number of targets including BGSU in Ohio.

When I tackled Ebsco, I ran into issues of site authentication via
cookies that were passed to the search gateway but not on to the client
browser. Peter Binkley, at the University of Alberta recommended a proxy
configuration to balance off this issue.  Essentially those connections
would have to continue to operate inside a search gateway proxied session.

I don't know how the perl tools stack up in terms of parallel search
streams.  The php/curl combination is purely serial and the last targets
will time out if there is a tardy responder in the middle of the serial
queue.  Art Rhyno, at the University of Windsor, suggested a parallel
approach might be possible in a Cocoon environment.  This has the
advantage of passing all the inbound HTML pages through JTidy and giving
you the XHTML/XML compliant input stream you wanted (in most cases, even
when the output from the target was some distance from compliance).

Walter Lewis
Halton Hills