zebra config problem (still 0, yes, really 0 !)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

zebra config problem (still 0, yes, really 0 !)

Paul POULAIN-2
Hello the list,

This time it seems zebra work for both indexing and search. The last
blocking problem was... a space in recordId: (bib1,Identifier-standard)
just after the comma. Adam agreed it was a bug, and it should be solved
soon. But now we are aware, we can avoid putting the space !

I've commited all what is needed to setup a working zebra DB in Unimarc
(in misc/migration_tools and /zebra directories) :

* collection.abs is UNIMARC specific and must be rewritten for MARC21,
in marc21 directory

* pdf.properties is to be copied unmodified in the marc21 directory (can
also be put somewhere else)

* rebuild_zebra.pl is SLOW, but 1 step reindexing tool, using ZOOM

* rebuild_zebra_idx is FAST, but 2 step reindexing tool, and does not
use zebra. run it, it will create all biblios XML files in
/zebra/biblios directory, then zebraidx update biblios in your zebra
directory

* zebra.cfg is the zebra config file ;-)

* test_cql2rpn.pl is a script that will query the database and show the
results. Works for me, just change the query at the beginning to get
answers you expect.

What has to be done :
* benchmarking : it seems the zebraidx update is faster than lightning
(400biblios/sec : 10 000biblios in 25seconds), while ZOOM indexing is
slow (something like 25biblios/second) More benchmarking could be done.
* completing collection.abs for UNIMARC. I'll take care of it.
* modifying Biblio.pm to use ZOOM instead of the "zebraidx through exec"
running actually. I'll take care of it also.
* modify the search API & tools & screens. I'll let the ball to someone
else (chris ?) for this. I agree SearchMarc.pm can be dropped and
replaced by something else (maybe a new-and-clean Search.pm package)
--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: zebra config problem (still 0, yes, really 0 !)

Mike Taylor-2
> Date: Thu, 09 Feb 2006 12:02:11 +0100
> From: Paul POULAIN <[hidden email]>
>
> This time it seems zebra work for both indexing and search.

Excellent news!

> What has to be done :
> * benchmarking : it seems the zebraidx update is faster than lightning
> (400biblios/sec : 10 000biblios in 25seconds), while ZOOM indexing is
> slow (something like 25biblios/second) More benchmarking could be done.

That is a surprising difference, since as you no doubt know, "ZOOM
indexing" is merely the use of ZOOM to pass the records to Zebra for
indexing.  I flatly refuse to believe that the communication layer is
responsible for a slow-down by a factor of 16, so something else is
going on here.  My best guess is that "zebraidx update" is making use
of caching mechanisms that ZOOM's update requests are not benefiting
from.  There may be a way to have ZOOM request that caching: Adam will
be able to tell us.

 _/|_ ___________________________________________________________________
/o ) \/  Mike Taylor  <[hidden email]>  http://www.miketaylor.org.uk
)_v__/\  "Lisp is just glorified C with completely different brackets"
         -- Harvey "Max" Thompson.




_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: zebra config problem (still 0, yes, really 0 !)

Paul POULAIN-2
Mike Taylor a écrit :

>>* benchmarking : it seems the zebraidx update is faster than lightning
>>(400biblios/sec : 10 000biblios in 25seconds), while ZOOM indexing is
>>slow (something like 25biblios/second) More benchmarking could be done.
> That is a surprising difference, since as you no doubt know, "ZOOM
> indexing" is merely the use of ZOOM to pass the records to Zebra for
> indexing.  I flatly refuse to believe that the communication layer is
> responsible for a slow-down by a factor of 16, so something else is
> going on here.  My best guess is that "zebraidx update" is making use
> of caching mechanisms that ZOOM's update requests are not benefiting
> from.  There may be a way to have ZOOM request that caching: Adam will
> be able to tell us.

Just a bet :
If I hear my SCSI disk correctly & read logs accordingly, it seems that
the zebraidx update reads all records and indexes them all at once,
while ZOOM indexes them one by one.
Thus, you have a lot of useless SCSI writes.

The main question here is that we haven't decided yet wether we will
store item status in zebra DB of just in SQL. (If we store status in
zebra, then z3950 queries could be 100% complete). With a 25/s indexing
speed, i think we could afford it. But I didn't made a true benchmark,
it's just a 1st measure !

(note I won't play lottery if I won my bet, as euro million has been won
last week : 183 000 000 EUR ! we're back to a small 15 000 000,
increased by something like 15 every week until someone win !)

--
Paul POULAIN et Henri Damien LAURENT
Consultants indépendants
en logiciels libres et bibliothéconomie (http://www.koha-fr.org)


_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: zebra config problem (still 0, yes, really 0 !)

Sebastian Hammer
In reply to this post by Mike Taylor-2
Mike Taylor wrote:

>>Date: Thu, 09 Feb 2006 12:02:11 +0100
>>From: Paul POULAIN <[hidden email]>
>>
>>This time it seems zebra work for both indexing and search.
>>    
>>
>
>Excellent news!
>
>  
>
>>What has to be done :
>>* benchmarking : it seems the zebraidx update is faster than lightning
>>(400biblios/sec : 10 000biblios in 25seconds), while ZOOM indexing is
>>slow (something like 25biblios/second) More benchmarking could be done.
>>    
>>
>
>That is a surprising difference, since as you no doubt know, "ZOOM
>indexing" is merely the use of ZOOM to pass the records to Zebra for
>indexing.  I flatly refuse to believe that the communication layer is
>responsible for a slow-down by a factor of 16, so something else is
>going on here.  My best guess is that "zebraidx update" is making use
>of caching mechanisms that ZOOM's update requests are not benefiting
>from.  There may be a way to have ZOOM request that caching: Adam will
>be able to tell us.
>
>  
>
I don't particularly refuse to believe the factor 16. Network overheads
are very expensive.

I think you're only transfering one record at a time, right? That's bad
for network latency in itself, but I dunno if either the client API or
the server will let us do more at a time.. it might be worth
considering, because adding more records at a time is generally a lot
faster (per record) than adding single records.

One thing to try, if you don't already do it, is to use a UNIX domain
socket if your server and client run on the same site..

zebrasrv unix:s

Will listen on a socket called 's'... since Unix Domain sockets  
circumvent the network stuff and work just like pipes, they can be tons
faster... I usually use this when using Zebra as an embedded engine. I'm
hoping Mike's ZOOM Perl wrapper won't prevent you from using this option
of the lower-layer tool.

--Seb

> _/|_ ___________________________________________________________________
>/o ) \/  Mike Taylor  <[hidden email]>  http://www.miketaylor.org.uk
>)_v__/\  "Lisp is just glorified C with completely different brackets"
> -- Harvey "Max" Thompson.
>
>
>
>
>_______________________________________________
>Koha-zebra mailing list
>[hidden email]
>http://lists.nongnu.org/mailman/listinfo/koha-zebra
>
>  
>

--
Sebastian Hammer, Index Data
[hidden email]   www.indexdata.com
Ph: (603) 209-6853



_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: zebra config problem (still 0, yes, really 0 !)

Mike Taylor-2
> Date: Thu, 09 Feb 2006 13:25:50 -0500
> From: Sebastian Hammer <[hidden email]>
>
>> I flatly refuse to believe that the communication layer is
>> responsible for a slow-down by a factor of 16, so something else is
>> going on here.  My best guess is that "zebraidx update" is making
>> use of caching mechanisms that ZOOM's update requests are not
>> benefiting from.  There may be a way to have ZOOM request that
>> caching: Adam will be able to tell us.
>
> I don't particularly refuse to believe the factor 16. Network
> overheads are very expensive.

C'mon, dude.  We're adding and indexing records here.  A typical MARC
record might contain say 30 indexable words, so naively you're doing
30 seeks and writes.  Suppose caching reduces that by a factor of
ten.  Still -- you're seeking and writing multiple times.  No
in-memory copy (which is all the TCP/IP socket write is) is going to
come close to using that much time.  As always we should benchmark
this rather than just spouting opinions, but I bet you 422 trillion
Canadian dollars that switching to a Unix-domain socket makes very
little difference indeed.

(If the client and server were on different machines, I would be less
disinclined to swallow your hypothesis.)

> I think you're only transfering one record at a time, right? That's
> bad for network latency in itself, but I dunno if either the client
> API or the server will let us do more at a time.. it might be worth
> considering, because adding more records at a time is generally a
> lot faster (per record) than adding single records.

Yes, this would be a reasonable enhancement to make to our update API.

> One thing to try, if you don't already do it, is to use a UNIX domain
> socket if your server and client run on the same site..
>
> zebrasrv unix:s
>
> Will listen on a socket called 's'... since Unix Domain sockets
> circumvent the network stuff and work just like pipes, they can be
> tons faster... I usually use this when using Zebra as an embedded
> engine. I'm hoping Mike's ZOOM Perl wrapper won't prevent you from
> using this option of the lower-layer tool.

It won't!  Go right ahead, Unix-domain sockets will work fine.

 _/|_ ___________________________________________________________________
/o ) \/  Mike Taylor  <[hidden email]>  http://www.miketaylor.org.uk
)_v__/\  "I never make predictions and I never will" -- Paul Gascoigne.



_______________________________________________
Koha-zebra mailing list
[hidden email]
http://lists.nongnu.org/mailman/listinfo/koha-zebra
Loading...