Re: Character problems with tictoc

From: Glen Newton <glen.newton_at_nyob>
Date: Mon, 21 Dec 2009 14:09:28 -0500
To: CODE4LIB_at_LISTSERV.ND.EDU
It seems that different people are seeing different things in their
respective viewers (i.e some are OK and others are like what I am
seeing). 

When I use wget and view the local file in Firefox (3.0.4, Linux Suse
11.0) I see:
 http://cuvier.cisti.nrc.ca/~gnewton/tictoc1.gif
[gif used as it is not lossy]

The text is clearly not correct.

The file I got with wget is:
  http://cuvier.cisti.nrc.ca/~gnewton/tictoc.txt

Is this just a question of different client software (and/or OSes)
viewing or mangling the content?

-glen

-------------------------------------------------------
Thanks for tracking this down Godmar. 
I've emailed tictocs and we'll see what they say.

-Glen :-)


------------------------------------------------------------------
From:         Godmar Back <godmar_at_GMAIL.COM>
Sender:       Code for Libraries <CODE4LIB_at_LISTSERV.ND.EDU>
To:           CODE4LIB_at_LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Character problems with tictoc
Date:         Mon, 21 Dec 2009 13:20:08 -0500
Message-ID:  <719dced30912211020y7b726c83jc54d0fadcba92827_at_mail.gmail.com>

The string in question is double-encoded, that is, a string that's in
UTF-8 already was run through a UTF-8 encoder.

The string is "Acta Ortopedica" where the 'e' is really '\u00e9' aka
'Latin Small Letter E with Acute'. [1]

In UTF-8, the e-acute is two-byte encoded as C3 A9.  If you run the
bytes C3 A9 through a UTF-8 encoder, C3 ('\u00c3' - Capital A with
tilde) becomes C3 83 and A9 (copyright sign, '\u00a9' becomes C2 A9).
C3 83 C2 A9 is exactly what JISC is serving, what it should be serving
is C3 A9.

Send email to them.

 - Godmar

[1] http://www.utf8-chartable.de/

2009/12/21 Glen Newton <glen.newton_at_nrc-cnrc.gc.ca>
>
> [I realise there was a recent related 'Character-sets for dummies'[1]
> discussion recently]
>
> I am using tictocs[2] list of journal RSS feeds, and I am getting
> gibberish in places for diacritics. Below is an example:
>
> in emacs:
>  221    Acta Ortop  dica Brasileira     http://www.scielo.br/rss.php?pid=1413-7852&lang=en      1413-7852
> in Firefox:
>  221    Acta Ortop  dica Brasileira     http://www.scielo.br/rss.php?pid=1413-7852&lang=en      1413-7852
>
> Note that the emacs view is both of a save of the Firefox, and from a
> direct download using 'wget'.
>
> Is this something on my end, or are the tictocs people not serving
> proper UTF-8?
>
> The HTTP header from wget claims UTF-8:
> > wget -S http://www.tictocs.ac.uk/text.php
> > --2009-12-21 12:47:59--  http://www.tictocs.ac.uk/text.php
> > Resolving www.tictocs.ac.uk... 130.88.101.131
> > Connecting to www.tictocs.ac.uk|130.88.101.131|:80... connected.
> > HTTP request sent, awaiting response...
> >   HTTP/1.1 200 OK
> >   Date: Mon, 21 Dec 2009 17:42:05 GMT
> >   Server: Apache/2.2.13 (Unix) mod_ssl/2.2.13 OpenSSL/0.9.8k PHP/5.3.0 DAV/2
> >   X-Powered-By: PHP/5.3.0
> >   Content-Type: text/plain; charset=utf-8
> >   Connection: close
> > Length: unspecified [text/plain]
> ><....stuff removed>
>
> Can someone validate if they are also experiencing this issue?
>
> Thanks,
> Glen
>
> [1]https://listserv.nd.edu/cgi-bin/wa?S2=CODE4LIB&q=&s=character-sets+for+dummies&f=&a=&b=
> [2]http://www.tictocs.ac.uk/text.php
>
> --
> Glen Newton | glen.newton_at_nrc-cnrc.gc.ca
> Researcher, Information Science, CISTI Research
> & NRC W3C Advisory Committee Representative
> http://tinyurl.com/yvchmu
> tel/t l: 613-990-9163 | facsimile/t l copieur 613-952-8246
> Canada Institute for Scientific and Technical Information (CISTI)
> National Research Council Canada (NRC)| M-55, 1200 Montreal Road
> http://www.nrc-cnrc.gc.ca/
> Institut canadien de l'information scientifique et technique (ICIST)
> Conseil national de recherches Canada | M-55, 1200 chemin Montr al
> Ottawa, Ontario K1A 0R6
> Government of Canada | Gouvernement du Canada
> --
Received on Mon Dec 21 2009 - 14:05:29 EST