Quantcast

utf-8, probable solution

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

utf-8, probable solution

Paul POULAIN-2
Thanks to Heikki Levanto, Tümer Garip & Mike Rylander, you pointed 3
things useless alone, but very useful when mixed.

I think I have the solution to our problem. It's not a zebra or
html::template or marc::record problem, it's a Perl one !

Let me explain :
I followed my utf-8 string in my perl Code until printed and it was
always utf-8 (\x9c...)
But in firefox, it was iso8859-1.

Heikki told me that the first 255 char were shared by unicode and
iso8859-1. So, I told myself : OK, Paul, add a "true utf-8 character to
your string". I choose \x{263a} (the smiley, because i'm always
optimistic & that is what is used in perluniintro)

Surprise ... now my é was a utf-8 é in firefox !!!!
Conclusion : perl looked at my string before sendint it, and, as it
finds it's not "true utf-8", Perl did something to change it in iso8859-1.

I also had a brand new message in my log :
 >            Wide character in print at ...

Mike R. and Tümer G. suggestions make me investigate perldoc on unicode.
and here it is :

>        A user of Perl does not normally need to know nor care how Perl happens to encode its internal strings, but it becomes rele-
>        vant when outputting Unicode strings to a stream without a PerlIO layer -- one with the "default" encoding.  In such a case,
>        the raw bytes used internally (the native character set or UTF-8, as appropriate for each string) will be used, and a "Wide
>        character" warning will be issued if those strings contain a character beyond 0x00FF.
>        For example,
>              perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
>        produces a fairly useless mixture of native bytes and UTF-8, as well as a warning:
>             Wide character in print at ...
>        To output UTF-8, use the ":utf8" output layer.  Prepending
>              binmode(STDOUT, ":utf8");
>        to this sample program ensures that the output is completely UTF-8, and removes the program's warning.


GOTCHA ! I have added binmode(STDOUT, ":utf8"), and now, even without
the smiley, my éà... are correctly shown.

Still having to investigate mySQL utf-8, but it seems that
 > set NAMES=utf8
is useless.

Thanks everybody for helping me. I'll continue this thread on koha-devel
only, as zebra & perl4lib are not interested probably.
--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [Zebralist] utf-8, probable solution

Sebastian Hammer
Paul,

this confirms our impressions at Index Data.. somehow, while PHP has
managed to approach Unicode in a way that mostly 'just works' (probably
by not doing more than necessary to it), Perl seems to have all kinds of
internal logic which has the effect of making  Unicode really, really
complicated and unintuitive. We had two guys spending a week or so each
trying to make heads or tails of the UTF-8 tutorial, and still we felt
at the end like we were fudging around the problem rather than really
solving it well.

I'm *not* fond of Perl's approach to Unicode.

--Sebastian

Paul POULAIN wrote:

> Thanks to Heikki Levanto, Tümer Garip & Mike Rylander, you pointed 3
> things useless alone, but very useful when mixed.
>
> I think I have the solution to our problem. It's not a zebra or
> html::template or marc::record problem, it's a Perl one !
>
> Let me explain :
> I followed my utf-8 string in my perl Code until printed and it was
> always utf-8 (\x9c...)
> But in firefox, it was iso8859-1.
>
> Heikki told me that the first 255 char were shared by unicode and
> iso8859-1. So, I told myself : OK, Paul, add a "true utf-8 character
> to your string". I choose \x{263a} (the smiley, because i'm always
> optimistic & that is what is used in perluniintro)
>
> Surprise ... now my é was a utf-8 é in firefox !!!!
> Conclusion : perl looked at my string before sendint it, and, as it
> finds it's not "true utf-8", Perl did something to change it in
> iso8859-1.
>
> I also had a brand new message in my log :
> >            Wide character in print at ...
>
> Mike R. and Tümer G. suggestions make me investigate perldoc on unicode.
> and here it is :
>
>>        A user of Perl does not normally need to know nor care how
>> Perl happens to encode its internal strings, but it becomes rele-
>>        vant when outputting Unicode strings to a stream without a
>> PerlIO layer -- one with the "default" encoding.  In such a case,
>>        the raw bytes used internally (the native character set or
>> UTF-8, as appropriate for each string) will be used, and a "Wide
>>        character" warning will be issued if those strings contain a
>> character beyond 0x00FF.
>>        For example,
>>              perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'
>>        produces a fairly useless mixture of native bytes and UTF-8,
>> as well as a warning:
>>             Wide character in print at ...
>>        To output UTF-8, use the ":utf8" output layer.  Prepending
>>              binmode(STDOUT, ":utf8");
>>        to this sample program ensures that the output is completely
>> UTF-8, and removes the program's warning.
>
>
>
> GOTCHA ! I have added binmode(STDOUT, ":utf8"), and now, even without
> the smiley, my éà... are correctly shown.
>
> Still having to investigate mySQL utf-8, but it seems that
> > set NAMES=utf8
> is useless.
>
> Thanks everybody for helping me. I'll continue this thread on
> koha-devel only, as zebra & perl4lib are not interested probably.


--
Sebastian Hammer, Index Data
[hidden email]   www.indexdata.com
Ph: (603) 209-6853




_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Re: [Zebralist] utf-8, probable solution

Paul POULAIN-2
Sebastian Hammer a écrit :

> Paul,
>
> this confirms our impressions at Index Data.. somehow, while PHP has
> managed to approach Unicode in a way that mostly 'just works' (probably
> by not doing more than necessary to it), Perl seems to have all kinds of
> internal logic which has the effect of making  Unicode really, really
> complicated and unintuitive. We had two guys spending a week or so each
> trying to make heads or tails of the UTF-8 tutorial, and still we felt
> at the end like we were fudging around the problem rather than really
> solving it well.
>
> I'm *not* fond of Perl's approach to Unicode.

I skope with a french guy involved in Perl a lot (Paul Gaborit)
It seems that the 5.8.x Unicode handling in Perl had a major goal :
change nothing to previous scripts, that must continue working in 5.8 as
well as they used to work in 5.6
That's why Perl tries to find what you want to do, and if you don't
force it to speak utf8, it will revert back to something more compatible
with "old world"

Not a bad solution I think, except it's a solution hard to find.

Another problem we still have it seems, it that DBD::mySQL seems to be
handling poorly utf8 mysql :-(

--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-devel
Loading...