Re: Unimarc, marc21, Unicode, and MARC::File::XML

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Unimarc, marc21, Unicode, and MARC::File::XML

Paul POULAIN-2
Mike Rylander a écrit :
> I tested with the record you sent Ed and me, and everything seems to
> work for me ...
> As you can see, I tested several variants of the UNIMARC flag, and
> even tested not sending the encoding to new_from_xml() ... it all
> seems to work for me, and I'm not sure what problems you're seeing.
> Perhaps you just needed to set your binmode for the XML source?

strange, strange...

What does my script :
* retrieve the MARC::Record from zebra
* read some datas from mysql
* build a page with HTML::Template
* send the pages to the browser

I added 3 lines to save the record in a file after reading from zebra.
Adding binmode(F,':utf8');
before saving my record in F, give me correct UTF-8.
without binmode, it's NOK.

But when I put the MARC::record in a page builded with HTML::Template,
it's wrong.
The HTML is utf-8 (html page encoding).
It also contains some strings from mySQL and all strings from mySQL
appear as correct utf8 while all strings coming from the MARC::record
coming from zebra are not !

I can add "binmode()" to the template output, but everything goes wrong
with strings from mySQL.

Any suggestion welcomed !
--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: Unimarc, marc21, Unicode, and MARC::File::XML

Mike Rylander
On 3/20/06, Paul POULAIN <[hidden email]> wrote:

> Mike Rylander a écrit :
> > I tested with the record you sent Ed and me, and everything seems to
> > work for me ...
> > As you can see, I tested several variants of the UNIMARC flag, and
> > even tested not sending the encoding to new_from_xml() ... it all
> > seems to work for me, and I'm not sure what problems you're seeing.
> > Perhaps you just needed to set your binmode for the XML source?
>
> strange, strange...
>
> What does my script :
> * retrieve the MARC::Record from zebra
> * read some datas from mysql
> * build a page with HTML::Template
> * send the pages to the browser

Are you getting XML or binary MARC from zebra?

>
> I added 3 lines to save the record in a file after reading from zebra.
> Adding binmode(F,':utf8');
> before saving my record in F, give me correct UTF-8.
> without binmode, it's NOK.
>
> But when I put the MARC::record in a page builded with HTML::Template,
> it's wrong.
> The HTML is utf-8 (html page encoding).
> It also contains some strings from mySQL and all strings from mySQL
> appear as correct utf8 while all strings coming from the MARC::record
> coming from zebra are not !
>
> I can add "binmode()" to the template output, but everything goes wrong
> with strings from mySQL.
>

Are you using decode_utf8($mysql_string) to let Perl know that the
database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
about that, and the DBD::MySQL maintainer haven't added that
functionality to the module yet.

> Any suggestion welcomed !
> --
> Paul POULAIN et Henri Damien LAURENT
> Consultants indépendants
> en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
>


--
Mike Rylander
[hidden email]
GPLS -- PINES Development
Database Developer
http://open-ils.org


_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML

Pierrick LE GALL
Hello Mike,

I'll answer to the second question, since I worked with Paul on
Perl/MySQL and UTF-8...

On Mon, 20 Mar 2006 09:59:32 -0500
"Mike Rylander" <[hidden email]> wrote:

> Are you using decode_utf8($mysql_string) to let Perl know that the
> database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
> about that, and the DBD::MySQL maintainer haven't added that
> functionality to the module yet.

We don't use decode_utf8. Just after the database handler creation, we
force communication to be UTF-8 with "set names 'UTF8'" SQL query. As
we know our data are UTF-8 stored and we want UTF-8, all works fine.

Bye

--
Pierrick LE GALL
INEO media system


_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: Unimarc, marc21, Unicode, and MARC::File::XML

Paul POULAIN-2
In reply to this post by Mike Rylander
Mike Rylander a écrit :

> On 3/20/06, Paul POULAIN <[hidden email]> wrote:
>
>>Mike Rylander a écrit :
>>
>>>I tested with the record you sent Ed and me, and everything seems to
>>>work for me ...
>>>As you can see, I tested several variants of the UNIMARC flag, and
>>>even tested not sending the encoding to new_from_xml() ... it all
>>>seems to work for me, and I'm not sure what problems you're seeing.
>>>Perhaps you just needed to set your binmode for the XML source?
>>
>>strange, strange...
>>
>>What does my script :
>>* retrieve the MARC::Record from zebra
>>* read some datas from mysql
>>* build a page with HTML::Template
>>* send the pages to the browser
> Are you getting XML or binary MARC from zebra?

XML. The test.xml I sended to you on friday comes was the
             $raw = $rs->record(0)->raw();
record.

> Are you using decode_utf8($mysql_string) to let Perl know that the
> database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
> about that, and the DBD::MySQL maintainer haven't added that
> functionality to the module yet.

I thought we had to decode_utf8($mysql_string), and began to investigate
a lot. But after many hours of digging & getting problems, I now have a
working mySQL in utf8 for all of Koha.
without any binmode of decode_utf8 ...
And it seems joshua & Tümer (Turkey) has the same conclusion : no more
problems with mySQL & Perl.
We all use a recent version of mySQL, even if DBD::mysql maintainer
(from mysql.com : joshua dropped him a mail but got no answer) did
nothing on the cpan package.

--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML

Mike Rylander
In reply to this post by Pierrick LE GALL
On 3/20/06, Pierrick LE GALL <[hidden email]> wrote:

> Hello Mike,
>
> I'll answer to the second question, since I worked with Paul on
> Perl/MySQL and UTF-8...
>
> On Mon, 20 Mar 2006 09:59:32 -0500
> "Mike Rylander" <[hidden email]> wrote:
>
> > Are you using decode_utf8($mysql_string) to let Perl know that the
> > database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
> > about that, and the DBD::MySQL maintainer haven't added that
> > functionality to the module yet.
>
> We don't use decode_utf8. Just after the database handler creation, we
> force communication to be UTF-8 with "set names 'UTF8'" SQL query. As
> we know our data are UTF-8 stored and we want UTF-8, all works fine.
>

Except that Perl doesn't know that the data is already UTF8 ... which
is the problem.  Perl /does/ know that the MARC data is UTF8, and it
has to convert one string or the other on output.  If you explicitly
use binmode() to set the PerlIO state to utf8, then the MARC::Record
strings, which are known good UTF8, are not transformed, but the MySQL
data, of which Perl has no encoding notions, gets "transformed", and
thus broken.

The only consistent and correct way to deal with UTF8 data in perl is
to let PerlIO handle it by marking all sources as either providing
UTF8 data or not.  You can do that with binmode(), open() and several
other ways, including this in modern Perls (
http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm ).  Because
DBD::mysql doesn't give you a way to mark its socket as UTF8, you need
to be a little underhanded and tell Perl as soon as possible using
decode(), or by making utf8 the default mode for all PerlIO channels.
There really isn't any way around this if you want to claim real UTF8
support and be able to use components that really do support UTF8
natively, like MARC::File::XML and MARC::Record.

It's unfortunate that the DBD::mysql people won't fix their module,
but there really is a right way to do this, even without their help.
Is there a performance penalty with decode()?  Yep.  Would that go
away with a fix to the DBD::mysql module?  Mostly, so you really need
to bug them.

> Bye
>
> --
> Pierrick LE GALL
> INEO media system
>


--
Mike Rylander
[hidden email]
GPLS -- PINES Development
Database Developer
http://open-ils.org


_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-devel
Reply | Threaded
Open this post in threaded view
|

Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and MARC::File::XML

Pierrick LE GALL
On Mon, 20 Mar 2006 10:54:08 -0500
"Mike Rylander" <[hidden email]> wrote:

> Except that Perl doesn't know that the data is already UTF8 ... which
> is the problem. [...]

You're completely right, I understand the difference. We made UTF8 work
from MySQL bu we didn't tried to work on data coming from MySQL. Just
"select ..." and "print". So it works but we are limited on strings
processing.

> It's unfortunate that the DBD::mysql people won't fix their module,
> but there really is a right way to do this, even without their help.
> Is there a performance penalty with decode()?  Yep.  Would that go
> away with a fix to the DBD::mysql module?  Mostly, so you really need
> to bug them.

The problem with decode() is the impact. Adding this process on each
string retrieved from MySQL represents hundreds of code lines. Not so
hard to modify but the solution is not /elegant/. Being able to flag
data coming from MySQL as UTF8 to Perl would be the /elegant/ solution,
as you said. Maybe we should try harder to have this feature from
DBD::mysql developers.

Thanks for your precisions.

Bye

--
Pierrick LE GALL
INEO media system


_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-devel