Re: unwanted (bogus) characters in marc

From: Thomas Krichel <krichel_at_nyob> Date: Sun, 10 Oct 2010 22:14:00 +0200 To: CODE4LIB_at_LISTSERV.ND.EDU

  stuart yeates writes

> Thomas Krichel wrote:

  ...

> >  It will try to guess between UTF-8 and ISO-8859-1. This can be done
> >  because UTF-8 has many invalid byte sequences.  But say if you
> >  wanted to guess between ISO-8859-1 and ISO-8859-2, you'd be out of
> >  luck.
> 
> Not necessarily.

  I meant you would be out of luck with the tool I proposed. 

> There are tools such as http://www.let.rug.nl/~vannoord/TextCat/
> which provide very reliable guessing of languages.

  I am happy to read this, I had requirements for language
  detection several times already.

  But the detection of languages is a bit of a different 
  problem than the detection of character codes. 

  Cheers,

  Thomas Krichel                    http://openlib.org/home/krichel
                                http://authorclaim.org/profile/pkr1
                                               skype: thomaskrichel