Re: Unimarc, marc21, Unicode, and MARC::File::XML

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Unimarc, marc21, Unicode, and MARC::File::XML

Paul POULAIN-2
Mike Rylander a écrit :
> I tested with the record you sent Ed and me, and everything seems to
> work for me ...
> As you can see, I tested several variants of the UNIMARC flag, and
> even tested not sending the encoding to new_from_xml() ... it all
> seems to work for me, and I'm not sure what problems you're seeing.
> Perhaps you just needed to set your binmode for the XML source?

strange, strange...

What does my script :
* retrieve the MARC::Record from zebra
* read some datas from mysql
* build a page with HTML::Template
* send the pages to the browser

I added 3 lines to save the record in a file after reading from zebra.
Adding binmode(F,':utf8');
before saving my record in F, give me correct UTF-8.
without binmode, it's NOK.

But when I put the MARC::record in a page builded with HTML::Template,
it's wrong.
The HTML is utf-8 (html page encoding).
It also contains some strings from mySQL and all strings from mySQL
appear as correct utf8 while all strings coming from the MARC::record
coming from zebra are not !

I can add "binmode()" to the template output, but everything goes wrong
with strings from mySQL.

Any suggestion welcomed !
--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Unimarc, marc21, Unicode, and MARC::File::XML

Mike Rylander
On 3/20/06, Paul POULAIN <[hidden email]> wrote:

> Mike Rylander a écrit :
> > I tested with the record you sent Ed and me, and everything seems to
> > work for me ...
> > As you can see, I tested several variants of the UNIMARC flag, and
> > even tested not sending the encoding to new_from_xml() ... it all
> > seems to work for me, and I'm not sure what problems you're seeing.
> > Perhaps you just needed to set your binmode for the XML source?
>
> strange, strange...
>
> What does my script :
> * retrieve the MARC::Record from zebra
> * read some datas from mysql
> * build a page with HTML::Template
> * send the pages to the browser

Are you getting XML or binary MARC from zebra?

>
> I added 3 lines to save the record in a file after reading from zebra.
> Adding binmode(F,':utf8');
> before saving my record in F, give me correct UTF-8.
> without binmode, it's NOK.
>
> But when I put the MARC::record in a page builded with HTML::Template,
> it's wrong.
> The HTML is utf-8 (html page encoding).
> It also contains some strings from mySQL and all strings from mySQL
> appear as correct utf8 while all strings coming from the MARC::record
> coming from zebra are not !
>
> I can add "binmode()" to the template output, but everything goes wrong
> with strings from mySQL.
>

Are you using decode_utf8($mysql_string) to let Perl know that the
database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
about that, and the DBD::MySQL maintainer haven't added that
functionality to the module yet.

> Any suggestion welcomed !
> --
> Paul POULAIN et Henri Damien LAURENT
> Consultants indépendants
> en logiciels libres et bibliothéconomie (http://www.koha-fr.org)
>


--
Mike Rylander
[hidden email]
GPLS -- PINES Development
Database Developer
http://open-ils.org


_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Re: Unimarc, marc21, Unicode, and MARC::File::XML

Pierrick LE GALL
Hello Mike,

I'll answer to the second question, since I worked with Paul on
Perl/MySQL and UTF-8...

On Mon, 20 Mar 2006 09:59:32 -0500
"Mike Rylander" <[hidden email]> wrote:

> Are you using decode_utf8($mysql_string) to let Perl know that the
> database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
> about that, and the DBD::MySQL maintainer haven't added that
> functionality to the module yet.

We don't use decode_utf8. Just after the database handler creation, we
force communication to be UTF-8 with "set names 'UTF8'" SQL query. As
we know our data are UTF-8 stored and we want UTF-8, all works fine.

Bye

--
Pierrick LE GALL
INEO media system


_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Unimarc, marc21, Unicode, and MARC::File::XML

Paul POULAIN-2
In reply to this post by Mike Rylander
Mike Rylander a écrit :

> On 3/20/06, Paul POULAIN <[hidden email]> wrote:
>
>>Mike Rylander a écrit :
>>
>>>I tested with the record you sent Ed and me, and everything seems to
>>>work for me ...
>>>As you can see, I tested several variants of the UNIMARC flag, and
>>>even tested not sending the encoding to new_from_xml() ... it all
>>>seems to work for me, and I'm not sure what problems you're seeing.
>>>Perhaps you just needed to set your binmode for the XML source?
>>
>>strange, strange...
>>
>>What does my script :
>>* retrieve the MARC::Record from zebra
>>* read some datas from mysql
>>* build a page with HTML::Template
>>* send the pages to the browser
> Are you getting XML or binary MARC from zebra?

XML. The test.xml I sended to you on friday comes was the
             $raw = $rs->record(0)->raw();
record.

> Are you using decode_utf8($mysql_string) to let Perl know that the
> database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
> about that, and the DBD::MySQL maintainer haven't added that
> functionality to the module yet.

I thought we had to decode_utf8($mysql_string), and began to investigate
a lot. But after many hours of digging & getting problems, I now have a
working mySQL in utf8 for all of Koha.
without any binmode of decode_utf8 ...
And it seems joshua & Tümer (Turkey) has the same conclusion : no more
problems with mySQL & Perl.
We all use a recent version of mySQL, even if DBD::mysql maintainer
(from mysql.com : joshua dropped him a mail but got no answer) did
nothing on the cpan package.

--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Re: Unimarc, marc21, Unicode, and MARC::File::XML

Mike Rylander
In reply to this post by Pierrick LE GALL
On 3/20/06, Pierrick LE GALL <[hidden email]> wrote:

> Hello Mike,
>
> I'll answer to the second question, since I worked with Paul on
> Perl/MySQL and UTF-8...
>
> On Mon, 20 Mar 2006 09:59:32 -0500
> "Mike Rylander" <[hidden email]> wrote:
>
> > Are you using decode_utf8($mysql_string) to let Perl know that the
> > database is UTF8 encoded?  IIRC, MySQL doesn't know how to tell Perl
> > about that, and the DBD::MySQL maintainer haven't added that
> > functionality to the module yet.
>
> We don't use decode_utf8. Just after the database handler creation, we
> force communication to be UTF-8 with "set names 'UTF8'" SQL query. As
> we know our data are UTF-8 stored and we want UTF-8, all works fine.
>

Except that Perl doesn't know that the data is already UTF8 ... which
is the problem.  Perl /does/ know that the MARC data is UTF8, and it
has to convert one string or the other on output.  If you explicitly
use binmode() to set the PerlIO state to utf8, then the MARC::Record
strings, which are known good UTF8, are not transformed, but the MySQL
data, of which Perl has no encoding notions, gets "transformed", and
thus broken.

The only consistent and correct way to deal with UTF8 data in perl is
to let PerlIO handle it by marking all sources as either providing
UTF8 data or not.  You can do that with binmode(), open() and several
other ways, including this in modern Perls (
http://search.cpan.org/~nwclark/perl-5.8.8/lib/open.pm ).  Because
DBD::mysql doesn't give you a way to mark its socket as UTF8, you need
to be a little underhanded and tell Perl as soon as possible using
decode(), or by making utf8 the default mode for all PerlIO channels.
There really isn't any way around this if you want to claim real UTF8
support and be able to use components that really do support UTF8
natively, like MARC::File::XML and MARC::Record.

It's unfortunate that the DBD::mysql people won't fix their module,
but there really is a right way to do this, even without their help.
Is there a performance penalty with decode()?  Yep.  Would that go
away with a fix to the DBD::mysql module?  Mostly, so you really need
to bug them.

> Bye
>
> --
> Pierrick LE GALL
> INEO media system
>


--
Mike Rylander
[hidden email]
GPLS -- PINES Development
Database Developer
http://open-ils.org


_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Re: Unimarc, marc21, Unicode, and MARC::File::XML

Pierrick LE GALL
On Mon, 20 Mar 2006 10:54:08 -0500
"Mike Rylander" <[hidden email]> wrote:

> Except that Perl doesn't know that the data is already UTF8 ... which
> is the problem. [...]

You're completely right, I understand the difference. We made UTF8 work
from MySQL bu we didn't tried to work on data coming from MySQL. Just
"select ..." and "print". So it works but we are limited on strings
processing.

> It's unfortunate that the DBD::mysql people won't fix their module,
> but there really is a right way to do this, even without their help.
> Is there a performance penalty with decode()?  Yep.  Would that go
> away with a fix to the DBD::mysql module?  Mostly, so you really need
> to bug them.

The problem with decode() is the impact. Adding this process on each
string retrieved from MySQL represents hundreds of code lines. Not so
hard to modify but the solution is not /elegant/. Being able to flag
data coming from MySQL as UTF8 to Perl would be the /elegant/ solution,
as you said. Maybe we should try harder to have this feature from
DBD::mysql developers.

Thanks for your precisions.

Bye

--
Pierrick LE GALL
INEO media system


_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Unimarc, marc21, Unicode, and MARC::File::XML

Tümer Garip
In reply to this post by Paul POULAIN-2
Hi,

This problem if I understood it correctly has got nothing to do with
mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC
which I am not familiar with.
As you know (Paul) I have an utf-8 version working.

I had the same problem from records coming from zebra and found out that
it is not doing what it is supposed to do unless you explicitly set it
to utf-8. You have to explicitly put "encoding utf-8" in all your zebra
config files especially the zebra.cfg and your .abs . Otherwise unlike
the documentation saying that zebra character code is automatically set
by the xml encoding it DOES NOT.

Perl send xml to zebra with encoding utf-8 on the header and utf-8 data
in it. Zebra saves all the data in utf-8 but returns an xml saying
encoding iso8859-1 at the header and utf-8 characters in data. No module
can correct this as it is stupid.

I corrected the problem by adding encoding:UTF-8 in zebra.cfg,
record.abs, sort-string.chr

Hope it solves yours,

Tumer



_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Re: Unimarc, marc21, Unicode, and MARC::File::XML

Adam Dickmeiss
Tümer Garip wrote:

> Hi,
>
> This problem if I understood it correctly has got nothing to do with
> mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC
> which I am not familiar with.
> As you know (Paul) I have an utf-8 version working.
>
> I had the same problem from records coming from zebra and found out that
> it is not doing what it is supposed to do unless you explicitly set it
> to utf-8. You have to explicitly put "encoding utf-8" in all your zebra
> config files especially the zebra.cfg and your .abs . Otherwise unlike
> the documentation saying that zebra character code is automatically set
> by the xml encoding it DOES NOT.
I can't reproduce this (bug). Care to share a a config+example that
illustrates this (Inserts an XML record from Perl in UTF-8) ?

> Perl send xml to zebra with encoding utf-8 on the header and utf-8 data
> in it. Zebra saves all the data in utf-8 but returns an xml saying
> encoding iso8859-1 at the header and utf-8 characters in data. No module
> can correct this as it is stupid.
Just need to know when the stupidity starts:-)

/ Adam

> I corrected the problem by adding encoding:UTF-8 in zebra.cfg,
> record.abs, sort-string.chr
>
> Hope it solves yours,
>
> Tumer
>
>
>
> _______________________________________________
> Koha-zebra mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/koha-zebra
>



_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

RE: Re: Unimarc, marc21, Unicode, and MARC::File::XML

Tümer Garip
Hi Adam,
You seem a bit offended that was not my intention, just frustation
sometimes
makes me use harsh words and translanting them to english may be too
harsh.

I do not need to send you any config+examples cause I tested this with
your default config files. I am attaching an xml record in utf-8

Briefly I had default configuration files and build zebra with xml
records. When I noticed the problem
I used yaz-client to see what was going on. On my log I could see data
going in the zebra was with encoding utf-8
While yaz client was returning xml with headers saying iso-8859-1 while
I could actually see the utf-8 characters as they show as hex in yaz
client.

I have retried this procedures just now and it seems the same. Just
adding encoding:UTF-8 to zebra.cfg and restarting the server you get
correct heading and correct data. Please note that server has to be
restarted but zebradb does not have to be rebuilt.

Thanks
Tumer

-----Original Message-----
From: Adam Dickmeiss [mailto:[hidden email]]
Sent: Tuesday, March 21, 2006 9:00 PM
To: Tümer Garip
Cc: [hidden email]; [hidden email]
Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
MARC::File::XML


Tümer Garip wrote:

> Hi,
>
> This problem if I understood it correctly has got nothing to do with
> mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC
> which I am not familiar with. As you know (Paul) I have an utf-8
> version working.
>
> I had the same problem from records coming from zebra and found out
> that it is not doing what it is supposed to do unless you explicitly
> set it to utf-8. You have to explicitly put "encoding utf-8" in all
> your zebra config files especially the zebra.cfg and your .abs .
> Otherwise unlike the documentation saying that zebra character code is

> automatically set by the xml encoding it DOES NOT.
I can't reproduce this (bug). Care to share a a config+example that
illustrates this (Inserts an XML record from Perl in UTF-8) ?

> Perl send xml to zebra with encoding utf-8 on the header and utf-8
> data in it. Zebra saves all the data in utf-8 but returns an xml
> saying encoding iso8859-1 at the header and utf-8 characters in data.
> No module can correct this as it is stupid.
Just need to know when the stupidity starts:-)

/ Adam

> I corrected the problem by adding encoding:UTF-8 in zebra.cfg,
> record.abs, sort-string.chr
>
> Hope it solves yours,
>
> Tumer
>
>
>
> _______________________________________________
> Koha-zebra mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/koha-zebra
>



_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Re: Unimarc, marc21, Unicode, and MARC::File::XML

Adam Dickmeiss
Tümer Garip wrote:
> Hi Adam,
> You seem a bit offended that was not my intention, just frustation
> sometimes
> makes me use harsh words and translanting them to english may be too
> harsh.
>
> I do not need to send you any config+examples cause I tested this with
> your default config files. I am attaching an xml record in utf-8
If you're to receive help from me you need to to tell me which zebra.cfg
you're using. And show me the record + the way you indexed it (zebraidx
update ?)
>
> Briefly I had default configuration files and build zebra with xml
> records. When I noticed the problem
> I used yaz-client to see what was going on. On my log I could see data
> going in the zebra was with encoding utf-8
> While yaz client was returning xml with headers saying iso-8859-1 while
> I could actually see the utf-8 characters as they show as hex in yaz
> client.
I also need to know what you see? And you you'd expect to see.

/ Adam

> I have retried this procedures just now and it seems the same. Just
> adding encoding:UTF-8 to zebra.cfg and restarting the server you get
> correct heading and correct data. Please note that server has to be
> restarted but zebradb does not have to be rebuilt.
>
> Thanks
> Tumer
>
> -----Original Message-----
> From: Adam Dickmeiss [mailto:[hidden email]]
> Sent: Tuesday, March 21, 2006 9:00 PM
> To: Tümer Garip
> Cc: [hidden email]; [hidden email]
> Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
> MARC::File::XML
>
>
> Tümer Garip wrote:
>
>>Hi,
>>
>>This problem if I understood it correctly has got nothing to do with
>>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC
>>which I am not familiar with. As you know (Paul) I have an utf-8
>>version working.
>>
>>I had the same problem from records coming from zebra and found out
>>that it is not doing what it is supposed to do unless you explicitly
>>set it to utf-8. You have to explicitly put "encoding utf-8" in all
>>your zebra config files especially the zebra.cfg and your .abs .
>>Otherwise unlike the documentation saying that zebra character code is
>
>
>>automatically set by the xml encoding it DOES NOT.
>
> I can't reproduce this (bug). Care to share a a config+example that
> illustrates this (Inserts an XML record from Perl in UTF-8) ?
>
>
>>Perl send xml to zebra with encoding utf-8 on the header and utf-8
>>data in it. Zebra saves all the data in utf-8 but returns an xml
>>saying encoding iso8859-1 at the header and utf-8 characters in data.
>>No module can correct this as it is stupid.
>
> Just need to know when the stupidity starts:-)
>
> / Adam
>
>
>>I corrected the problem by adding encoding:UTF-8 in zebra.cfg,
>>record.abs, sort-string.chr
>>
>>Hope it solves yours,
>>
>>Tumer
>>
>>
>>
>>_______________________________________________
>>Koha-zebra mailing list
>>[hidden email]
>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>
>
>
>
>
> _______________________________________________
> Koha-zebra mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/koha-zebra
>



_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

RE: Re: Unimarc, marc21, Unicode, and MARC::File::XML

Tümer Garip
I thought I explained it but here it is again:

I do not think which method you use is relevant here but but just try
this:

In the release version ZEBRA test/usmarc folder change the zebra.cfg to
read
recordType: grs.xml
in the tabs folder change marc21.abs to read record.abs
Use zebraidx to create the database with the single XML record I sent to
you.
Start the zebrasrv at the required port.
Use yaz-client
f @attr 1=1016 book
format xml
show

I see the xml record header saying
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

Further down you'll see utf-8 characters of correct hex as
\XC5\X9F

Now stop  the server.
Add line encoding:utf-8 to your zebra.cfg
Restart the server
Do the same search you get
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

Conclusion:
The database does keep the data in UTF-8 as expected.
Server does not know about database character set or the xml record taht
was parsed in and unless specificly set to UTF-8 in Zebra.cfg srever
goes ahead and changes the header or in fact it produces itself a header
saying iso-8859-1 while giving out utf-8 characters.

I did not ask any help on this thanks. Just clearing some issues with
Paul's problem.
Tumer
-----Original Message-----
From: Adam Dickmeiss [mailto:[hidden email]]
Sent: Tuesday, March 21, 2006 9:58 PM
To: Tümer Garip
Cc: [hidden email]
Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
MARC::File::XML


Tümer Garip wrote:
> Hi Adam,
> You seem a bit offended that was not my intention, just frustation
> sometimes makes me use harsh words and translanting them to english
> may be too harsh.
>
> I do not need to send you any config+examples cause I tested this with

> your default config files. I am attaching an xml record in utf-8
If you're to receive help from me you need to to tell me which zebra.cfg

you're using. And show me the record + the way you indexed it (zebraidx
update ?)
>
> Briefly I had default configuration files and build zebra with xml
> records. When I noticed the problem I used yaz-client to see what was
> going on. On my log I could see data going in the zebra was with
> encoding utf-8 While yaz client was returning xml with headers saying
> iso-8859-1 while I could actually see the utf-8 characters as they
> show as hex in yaz client.
I also need to know what you see? And you you'd expect to see.

/ Adam

> I have retried this procedures just now and it seems the same. Just
> adding encoding:UTF-8 to zebra.cfg and restarting the server you get
> correct heading and correct data. Please note that server has to be
> restarted but zebradb does not have to be rebuilt.
>
> Thanks
> Tumer
>
> -----Original Message-----
> From: Adam Dickmeiss [mailto:[hidden email]]
> Sent: Tuesday, March 21, 2006 9:00 PM
> To: Tümer Garip
> Cc: [hidden email]; [hidden email]
> Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
> MARC::File::XML
>
>
> Tümer Garip wrote:
>
>>Hi,
>>
>>This problem if I understood it correctly has got nothing to do with
>>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC
>>which I am not familiar with. As you know (Paul) I have an utf-8
>>version working.
>>
>>I had the same problem from records coming from zebra and found out
>>that it is not doing what it is supposed to do unless you explicitly
>>set it to utf-8. You have to explicitly put "encoding utf-8" in all
>>your zebra config files especially the zebra.cfg and your .abs .
>>Otherwise unlike the documentation saying that zebra character code is
>
>
>>automatically set by the xml encoding it DOES NOT.
>
> I can't reproduce this (bug). Care to share a a config+example that
> illustrates this (Inserts an XML record from Perl in UTF-8) ?
>
>
>>Perl send xml to zebra with encoding utf-8 on the header and utf-8
>>data in it. Zebra saves all the data in utf-8 but returns an xml
>>saying encoding iso8859-1 at the header and utf-8 characters in data.
>>No module can correct this as it is stupid.
>
> Just need to know when the stupidity starts:-)
>
> / Adam
>
>
>>I corrected the problem by adding encoding:UTF-8 in zebra.cfg,
>>record.abs, sort-string.chr
>>
>>Hope it solves yours,
>>
>>Tumer
>>
>>
>>
>>_______________________________________________
>>Koha-zebra mailing list
>>[hidden email]
>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>
>
>
>
>
> _______________________________________________
> Koha-zebra mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/koha-zebra
>



_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Re: Unimarc, marc21, Unicode, and MARC::File::XML

Adam Dickmeiss
Tümer Garip wrote:

> I thought I explained it but here it is again:
>
> I do not think which method you use is relevant here but but just try
> this:
>
> In the release version ZEBRA test/usmarc folder change the zebra.cfg to
> read
> recordType: grs.xml
> in the tabs folder change marc21.abs to read record.abs
> Use zebraidx to create the database with the single XML record I sent to
> you.
> Start the zebrasrv at the required port.
> Use yaz-client
> f @attr 1=1016 book
> format xml
> show
>
> I see the xml record header saying
> <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
>
> Further down you'll see utf-8 characters of correct hex as
> \XC5\X9F
>
> Now stop  the server.
> Add line encoding:utf-8 to your zebra.cfg
> Restart the server
> Do the same search you get
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
>
> Conclusion:
> The database does keep the data in UTF-8 as expected.
> Server does not know about database character set or the xml record taht
> was parsed in and unless specificly set to UTF-8 in Zebra.cfg srever
> goes ahead and changes the header or in fact it produces itself a header
> saying iso-8859-1 while giving out utf-8 characters.

Correct. I was unable to reproduce this fault.. becauase my XML test
record was able to be represented in UNICODE/UTF-8. Your sample is NOT

Converion from UTF-8 to ISO-8859-1 fails in Zebra.. And in this case,
Zebra keeps data as is, but unfortunately alters the header anyway.
That's the mistake. Better behavior would probably be for Zebra to not
return the data at all, but return a surrogate diagnostic for the record ..

As you say, Zebra can be forced to use utf-8 in retrieval phase in the
configuration. You can also specify utf-8 via the Z39.50 protocol ..
(charset utf-8 in yaz-client).. and you should be able to achieve the
same with ZOOM-Perl.

For Zebra 1.3 we kept Latin-1 as defualt character set because of a
number of installations using that.. For Zebra 1.4 default is UTF-8.. so
there should not be a problem with that - in this case.

/ Adam

>
> I did not ask any help on this thanks. Just clearing some issues with
> Paul's problem.
> Tumer
> -----Original Message-----
> From: Adam Dickmeiss [mailto:[hidden email]]
> Sent: Tuesday, March 21, 2006 9:58 PM
> To: Tümer Garip
> Cc: [hidden email]
> Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
> MARC::File::XML
>
>
> Tümer Garip wrote:
>
>>Hi Adam,
>>You seem a bit offended that was not my intention, just frustation
>>sometimes makes me use harsh words and translanting them to english
>>may be too harsh.
>>
>>I do not need to send you any config+examples cause I tested this with
>
>
>>your default config files. I am attaching an xml record in utf-8
>
> If you're to receive help from me you need to to tell me which zebra.cfg
>
> you're using. And show me the record + the way you indexed it (zebraidx
> update ?)
>
>>Briefly I had default configuration files and build zebra with xml
>>records. When I noticed the problem I used yaz-client to see what was
>>going on. On my log I could see data going in the zebra was with
>>encoding utf-8 While yaz client was returning xml with headers saying
>>iso-8859-1 while I could actually see the utf-8 characters as they
>>show as hex in yaz client.
>
> I also need to know what you see? And you you'd expect to see.
>
> / Adam
>
>
>>I have retried this procedures just now and it seems the same. Just
>>adding encoding:UTF-8 to zebra.cfg and restarting the server you get
>>correct heading and correct data. Please note that server has to be
>>restarted but zebradb does not have to be rebuilt.
>>
>>Thanks
>>Tumer
>>
>>-----Original Message-----
>>From: Adam Dickmeiss [mailto:[hidden email]]
>>Sent: Tuesday, March 21, 2006 9:00 PM
>>To: Tümer Garip
>>Cc: [hidden email]; [hidden email]
>>Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
>>MARC::File::XML
>>
>>
>>Tümer Garip wrote:
>>
>>
>>>Hi,
>>>
>>>This problem if I understood it correctly has got nothing to do with
>>>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC
>>>which I am not familiar with. As you know (Paul) I have an utf-8
>>>version working.
>>>
>>>I had the same problem from records coming from zebra and found out
>>>that it is not doing what it is supposed to do unless you explicitly
>>>set it to utf-8. You have to explicitly put "encoding utf-8" in all
>>>your zebra config files especially the zebra.cfg and your .abs .
>>>Otherwise unlike the documentation saying that zebra character code is
>>
>>
>>>automatically set by the xml encoding it DOES NOT.
>>
>>I can't reproduce this (bug). Care to share a a config+example that
>>illustrates this (Inserts an XML record from Perl in UTF-8) ?
>>
>>
>>
>>>Perl send xml to zebra with encoding utf-8 on the header and utf-8
>>>data in it. Zebra saves all the data in utf-8 but returns an xml
>>>saying encoding iso8859-1 at the header and utf-8 characters in data.
>>>No module can correct this as it is stupid.
>>
>>Just need to know when the stupidity starts:-)
>>
>>/ Adam
>>
>>
>>
>>>I corrected the problem by adding encoding:UTF-8 in zebra.cfg,
>>>record.abs, sort-string.chr
>>>
>>>Hope it solves yours,
>>>
>>>Tumer
>>>
>>>
>>>
>>>_______________________________________________
>>>Koha-zebra mailing list
>>>[hidden email]
>>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>>
>>
>>
>>
>>
>>_______________________________________________
>>Koha-zebra mailing list
>>[hidden email]
>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>
>
>
>
>
> _______________________________________________
> Koha-zebra mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/koha-zebra
>



_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|

Re: Unicode, XML,Zebra,Windows

Tümer Garip
Hi Adam,
Well I am pleased you managed to reproduce this bug.
Here are a few things to consider going towards 1.4

Zebra document says:
"Generally, the files are simple ASCII files, which can be maintained
using any text editor. "
And it also says:
"encoding encodingname
This directive specifies character encoding for external records. For
records such as XML that specifies encoding within the file via a header
this directive is ignored. If neither this directive is given, nor an
encoding is set within external records, ISO-8859-1 encoding is assumed.
"

In fact this is not the case. As you have seen the XML file I send to
you had a header saying UTF-8 and also had utf-8 characters in it. In
the windows enviroment that was a genuine utf-8 document. Zebra not only
did not detect that but also ignored the header saying utf-8.

We have a similar problem with .chr files as well. In windows
environment Notepad is the simplest text editor. Write a sort.chr file
with notepad with some utf8 characters, put encoding utf-8 in it and
zebraidx gives syntax error. To overcome that I had to produce a
sort.chr file on a colleagues Unix and use that. When I look into that
file with notepad its unreadable so very difficult to maintain.

I think whats happening with this utf-8 thing is that windows and unix
are using different representations of whether a file is unicode. Since
you do a binary for windows as well I think zebra should stop checking
characters according to unix and rely on things like xml headers or
encoding directives of configuration files (which by the way iy says it
does but as you have seen it does not).


I should also say that I am testing 1.4 and it is very very more
efficient in terms of cpu and memory usage but this problem remains.
Well we can't have them all  or can we?

Regards,
Tumer

-----Original Message-----
From: Adam Dickmeiss [mailto:[hidden email]]
Sent: Tuesday, March 21, 2006 11:55 PM
To: Tümer Garip
Cc: [hidden email]
Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
MARC::File::XML


Tümer Garip wrote:

> I thought I explained it but here it is again:
>
> I do not think which method you use is relevant here but but just try
> this:
>
> In the release version ZEBRA test/usmarc folder change the zebra.cfg
> to read
> recordType: grs.xml
> in the tabs folder change marc21.abs to read record.abs
> Use zebraidx to create the database with the single XML record I sent
to

> you.
> Start the zebrasrv at the required port.
> Use yaz-client
> f @attr 1=1016 book
> format xml
> show
>
> I see the xml record header saying
> <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
>
> Further down you'll see utf-8 characters of correct hex as \XC5\X9F
>
> Now stop  the server.
> Add line encoding:utf-8 to your zebra.cfg
> Restart the server
> Do the same search you get
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
>
> Conclusion:
> The database does keep the data in UTF-8 as expected.
> Server does not know about database character set or the xml record
> taht was parsed in and unless specificly set to UTF-8 in Zebra.cfg
> srever goes ahead and changes the header or in fact it produces itself

> a header saying iso-8859-1 while giving out utf-8 characters.

Correct. I was unable to reproduce this fault.. becauase my XML test
record was able to be represented in UNICODE/UTF-8. Your sample is NOT

Converion from UTF-8 to ISO-8859-1 fails in Zebra.. And in this case,
Zebra keeps data as is, but unfortunately alters the header anyway.
That's the mistake. Better behavior would probably be for Zebra to not
return the data at all, but return a surrogate diagnostic for the record
..

As you say, Zebra can be forced to use utf-8 in retrieval phase in the
configuration. You can also specify utf-8 via the Z39.50 protocol ..
(charset utf-8 in yaz-client).. and you should be able to achieve the
same with ZOOM-Perl.

For Zebra 1.3 we kept Latin-1 as defualt character set because of a
number of installations using that.. For Zebra 1.4 default is UTF-8.. so

there should not be a problem with that - in this case.

/ Adam

>
> I did not ask any help on this thanks. Just clearing some issues with
> Paul's problem. Tumer
> -----Original Message-----
> From: Adam Dickmeiss [mailto:[hidden email]]
> Sent: Tuesday, March 21, 2006 9:58 PM
> To: Tümer Garip
> Cc: [hidden email]
> Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
> MARC::File::XML
>
>
> Tümer Garip wrote:
>
>>Hi Adam,
>>You seem a bit offended that was not my intention, just frustation
>>sometimes makes me use harsh words and translanting them to english
>>may be too harsh.
>>
>>I do not need to send you any config+examples cause I tested this with
>
>
>>your default config files. I am attaching an xml record in utf-8
>
> If you're to receive help from me you need to to tell me which
> zebra.cfg
>
> you're using. And show me the record + the way you indexed it
> (zebraidx
> update ?)
>
>>Briefly I had default configuration files and build zebra with xml
>>records. When I noticed the problem I used yaz-client to see what was
>>going on. On my log I could see data going in the zebra was with
>>encoding utf-8 While yaz client was returning xml with headers saying
>>iso-8859-1 while I could actually see the utf-8 characters as they
>>show as hex in yaz client.
>
> I also need to know what you see? And you you'd expect to see.
>
> / Adam
>
>
>>I have retried this procedures just now and it seems the same. Just
>>adding encoding:UTF-8 to zebra.cfg and restarting the server you get
>>correct heading and correct data. Please note that server has to be
>>restarted but zebradb does not have to be rebuilt.
>>
>>Thanks
>>Tumer
>>
>>-----Original Message-----
>>From: Adam Dickmeiss [mailto:[hidden email]]
>>Sent: Tuesday, March 21, 2006 9:00 PM
>>To: Tümer Garip
>>Cc: [hidden email]; [hidden email]
>>Subject: Re: [Koha-zebra] Re: Unimarc, marc21, Unicode, and
>>MARC::File::XML
>>
>>
>>Tümer Garip wrote:
>>
>>
>>>Hi,
>>>
>>>This problem if I understood it correctly has got nothing to do with
>>>mysql or perl it has to do with ZEBRA unless it is to do with UNIMARC

>>>which I am not familiar with. As you know (Paul) I have an utf-8
>>>version working.
>>>
>>>I had the same problem from records coming from zebra and found out
>>>that it is not doing what it is supposed to do unless you explicitly
>>>set it to utf-8. You have to explicitly put "encoding utf-8" in all
>>>your zebra config files especially the zebra.cfg and your .abs .
>>>Otherwise unlike the documentation saying that zebra character code
>>>is
>>
>>
>>>automatically set by the xml encoding it DOES NOT.
>>
>>I can't reproduce this (bug). Care to share a a config+example that
>>illustrates this (Inserts an XML record from Perl in UTF-8) ?
>>
>>
>>
>>>Perl send xml to zebra with encoding utf-8 on the header and utf-8
>>>data in it. Zebra saves all the data in utf-8 but returns an xml
>>>saying encoding iso8859-1 at the header and utf-8 characters in data.

>>>No module can correct this as it is stupid.
>>
>>Just need to know when the stupidity starts:-)
>>
>>/ Adam
>>
>>
>>
>>>I corrected the problem by adding encoding:UTF-8 in zebra.cfg,
>>>record.abs, sort-string.chr
>>>
>>>Hope it solves yours,
>>>
>>>Tumer
>>>
>>>
>>>
>>>_______________________________________________
>>>Koha-zebra mailing list
>>>[hidden email]
>>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>>
>>
>>
>>
>>
>>_______________________________________________
>>Koha-zebra mailing list
>>[hidden email]
>>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>>
>
>
>
>
> _______________________________________________
> Koha-zebra mailing list
> [hidden email]
> http://lists.nongnu.org/mailman/listinfo/koha-zebra
>



_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra