Re: [Spam] Re: mailinglist2rss.pl

From: Kevin S. Clarke <ksclarke_at_nyob>
Date: Mon, 9 Feb 2004 12:27:31 -0800
To: CODE4LIB_at_LISTSERV.ND.EDU
On Mon, 2004-02-09 at 11:41, Walter Lewis wrote:

>     One of the issues that I bumped into was that was passes for HTML in
> some email programs is [insert expletive of choice here].  Putting it in
> an XML data store was going to cause a tons of validation errors.

Some success might be found with TagSoup:
http://home.ccil.org/~cowan/XML/tagsoup/

It delivers SAX events from less than well-formed HTML.  It doesn't
correct validation or style problems though...  just provides a
consistent, well-formed interface to sloppy HTML.

An alternate approach, JTidy will do a good job of fixing many
validation problems, but it may fail depending on how bad the HTML is

http://jtidy.sourceforge.net/

TagSoup doesn't fail... "Just Keep[s] On Truckin'"

--
Kevin S. Clarke <ksclarke_at_stanford.edu>
Lane Medical Library, Stanford University
Received on Mon Feb 09 2004 - 15:38:39 UTC