Quantcast

Single character searches are slow when queryfuzzy and/or querystemming are enabled.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Single character searches are slow when queryfuzzy and/or querystemming are enabled.

barton
I've got an issue that we've seen periodically at ByWater; I want to file a bug, but I don't have a clear idea of how to replicate the issue, so I'm trying to solicit information from the community on how to narrow the scope of the problem and/or measure the performance problems that I'm seeing.

The problem is that keyword searches (as you'd see from an un-modified masthead search on the opac, or the "search the catalog" search on the staff client) are markedly slow on some sites when

a) The search term contains single letter words like "a" in "Once Upon A Time" and
b) One or both of the QueryFuzzy or QueryStemming system preferences is enabled.

We first ran across this issue when we moved a number our Koha instances to slower drives, and found that some (but not all) would time out when searching for "Once upon a time", "A swiftly tilting planet", "Hitchiker's Guide to the Galaxy" (where the 's' after the apostrophe is counted as a single word).

I turned onĀ 'request' logging in zebra, here's the query that timed out:

find @attrset Bib-1 @or @or @or @or @or @or @attr 1=36 @attr 4=1 @attr 6=3 @attr 9=32 @attr 2=102 "once upon a time" @attr 1=4 @attr 4=1 @attr 6=3 @attr 9=28 @attr 2=102 "once upon a time" @attr 1=36 @attr 4=1 @attr 9=26 @attr 2=102 "once upon a time" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102 "once upon a time" @attr 4=6 @attr 5=103 @attr 9=16 @attr 2=102 "once upon a time" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "onc? upon? a time? " @attr 4=6 @attr 9=14 @attr 2=102 "once upon a time"
Sent searchRequest.

I ran the PQF in zebra:

Z> base biblios
Z> find @attrset Bib-1 @or @or @or @or @or @or @attr 1=36 @attr 4=1 @attr 6=3 @attr 9=32 @attr 2=102 "once upon a time" @attr 1=4 @attr 4=1 @attr 6=3 @attr 9=28 @attr 2=102 "once upon a time" @attr 1=36 @attr 4=1 @attr 9=26 @attr 2=102 "once upon a time" @attr 1=4 @attr 4=6 @attr 9=24 @attr 2=102 "once upon a time" @attr 4=6 @attr 5=103 @attr 9=16 @attr 2=102 "once upon a time" @attr 4=6 @attr 5=1 @attr 9=14 @attr 2=102 "onc? upon? a time? " @attr 4=6 @attr 9=14 @attr 2=102 "once upon a time"
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 119, setno 1
SearchResult-1: term=once upon a time cnt=3, term=once upon a time cnt=5, term=once cnt=94, term=upon cnt=63, term=a cnt=4876, term=time cnt=515, term=once cnt=168, term=upon cnt=125, term=a cnt=19801, term=time cnt=1507, term=once cnt=9830, term=upon cnt=960, term=a cnt=82237, term=time cnt=8524, term=onc cnt=1915, term=upon cnt=914, term=a cnt=82237, term=time cnt=7709, term=once cnt=1302, term=upon cnt=914, term=a cnt=68283, term=time cnt=5442
records returned: 0
Elapsed: 98.163583

The elapsed time is just under 100 seconds; I think Koha times out after 60.

We looked at the disk I/O and found that the disk where the searches were occurring was getting hammered, so we migrated the sites that were having issues onto a server that had a much faster drive, and this did, at least solve the time-out issue.

The issue keeps on popping up, however -- if not in actual timeouts, then at least in complaints of slowness in search.

My hypothesis is that either QueryFuzzy or QueryStemming is expanding the one letter words, i.e. a search for "a" is returning results for either all words that start with "a" or all words contaning "a", and that all of these results are written to disk before any further filtering is done.

When we were seeing timeout issues, I'm not clear on which sites were likely to have issues -- sites with a very small number of bibs didn't have issues, but the problem wasn't solely by collection size. So here's what I'd like to know, for the purposes of further testing, and/or to be able to replicate the issue:

1) Is there anyone who knows all about PQF who can tell me why the query above would run so slowly?
2) Is there part of the PQF that seems to be behaving particularly badly? Which parts of the query are returning "term=a cnt=82237", and can we avoid having that called twice?
3) How do I go about constructing a data set that illustrates this problem?
4) What's the best way to measure the performance problems and/or define what the problem is when I file a bug?
5) Are there any work-arounds, given that certain sites really want query stemming and queryfuzzy enabled?

Thanks,

--Barton

_______________________________________________
Koha-devel mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-devel
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Loading...