[Bug 22258] New: Elasticsearch - full record is not indexed in plain text

classic Classic list List threaded Threaded
25 messages Options
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] New: Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

            Bug ID: 22258
           Summary: Elasticsearch - full record is not indexed in plain
                    text
 Change sponsored?: ---
           Product: Koha
           Version: master
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P5 - low
         Component: Searching - Elasticsearch
          Assignee: [hidden email]
          Reporter: [hidden email]
  Target Milestone: ---

This is to start a discussion - at some point we dropped the inclusion of the
full record in plain text in the ES index

A consequence is that data in a field that is not specifically indexed is no
longer searchable.

This is a change from zebra I believe where the entire record can be searched.

I am not sure of the affect on relevancy, but I think we need t o find a way to
support full record searching.

--
You are receiving this mail because:
You are the assignee for the bug.
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Nick Clemens <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Nick Clemens <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email],
                   |                            |[hidden email]-c
                   |                            |ommunity.org,
                   |                            |[hidden email]
                   |                            |, [hidden email],
                   |                            |martin.renvoize@ptfs-europe
                   |                            |.com,
                   |                            |[hidden email],
                   |                            |[hidden email],
                   |                            |[hidden email]

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #1 from Ere Maijala <[hidden email]> ---
I'd suggest something like allowing wildcards in the index mappings so that you
could index all the needed fields in the _all field using the index mappings.
This would make it possible to e.g. concatenate the subfields for proper phrase
searching. Additionally, there could be an option to index the MARC record so
that it's easily searchable by field. I wouldn't make it the default since it
can be sort of a really big hammer used on a really small nail, but having it
as an option would give nice flexibility.

--
You are receiving this mail because:
You are the assignee for the bug.
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Ere Maijala <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED
           Assignee|[hidden email]-commun |[hidden email]
                   |ity.org                     |

--
You are receiving this mail because:
You are watching all bug changes.
You are the assignee for the bug.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #2 from Nicolas Legrand <[hidden email]> ---
Hey Nick,

thank you for working on ES!

I think we have a better relevancy without having the whole notice indexed. I
guess it produces also a smaller index volume too and maybe a faster
indexation.

It looks redundant too me to do otherwise. Like doing something super precisely
very very neatly and in the same move throwing whatever you have in a big
bucket. Some people wants it because they fear to lose something in the way.
But, this may make noise like a cluster of thousands fuzz pedals with all knobs
all the way up (OK maybe I'm a bit dramatic here :)).

In the UNIMARC fields 4XX for example we have a lot of references. If the
notice is the notice of an article, you'll get the name of the serial in some
4XX fields. Yup, typing the name of a serial may bring up all his minions and
they will hide it in the tenth result page or something like that. So we have
to be sure how to sort it. Ok, with weighting on the title index, the serial
ends up on the top of the list, so maybe it is not that bad.

I'm not eager for this one, if it has to be, I'll put a super low weight on the
index entry to test it first. I'll also check the speed and the weight.

Cheers!

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #3 from Ere Maijala <[hidden email]> ---
Nicolas, don't worry, it will be completely optional. :)

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #4 from Nicolas Legrand <[hidden email]> ---
He he, I just fear the moment when my librarian colleagues will know about this
option :)

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #5 from Ere Maijala <[hidden email]> ---
Ah, yes :) Then again you could already add rules for all MARC fields, it's
just more cumbersome.

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #6 from Ere Maijala <[hidden email]> ---
Created attachment 85158
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=85158&action=edit
Bug 22258: Elasticsearch: Add array as an alternative MARC format

Adds preference ElasticsearchMARCFormat that controls whether MARC records are
stored as ISO2709/MARCXML or array. Array is searchable by field and also
indexes all subfields in the _all field for searching.

Test plan:
1. Test that searching and indexing works with the patch without any changes.
2. Switch to array format and index some records.
3. Check e.g. the 008 field of a record and verify that the record can be found
with the contents enclosed in quotes.
4. Check that it's possible to search for a specific field/subfield. Search
query: marc_data_array.fields.655.subfields.a:Diaries
5. Check that tests still pass, especially t/Koha/SearchEngine/Elasticsearch.t

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #7 from Ere Maijala <[hidden email]> ---
So, the patch allows one to change the format MARC is stored in Elasticsearch.
Additionally the new field type uses default search field settings that cause
the content to be indexed in _all for keyword searching.

I just realized the test plan doesn't quite work. Step 1 needs to be changed:

1. Test that searching and indexing works with the patch after recreating the
index.

Nick, do you think this makes sense and supports the needed use cases? I've yet
to implement the wildcard handling for mappings, and that can be left for
another issue.

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Ere Maijala <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |Needs Signoff

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #8 from Nick Clemens <[hidden email]> ---
(In reply to Ere Maijala from comment #7)
Yes, this seems an easy way to ensure the whole record can be searchable, I
like it as an optional format. Will test when I can

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Marjorie Barry-Vila <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |marjorie.barry-vila@collect
                   |                            |o.ca

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Michal Denar <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]

--- Comment #9 from Michal Denar <[hidden email]> ---
After aplly of patch are marc searchable at index, reindex works byt Koha
serach is broken.

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Katrin Fischer <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |[hidden email]
             Status|Needs Signoff               |Failed QA

--- Comment #10 from Katrin Fischer <[hidden email]> ---
see comment#9

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Ere Maijala <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|Failed QA                   |Needs Signoff

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Ere Maijala <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #85158|0                           |1
        is obsolete|                            |

--- Comment #11 from Ere Maijala <[hidden email]> ---
Created attachment 87436
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=87436&action=edit
Bug 22258: Elasticsearch: Add array as an alternative MARC format

Adds preference ElasticsearchMARCFormat that controls whether MARC records are
stored as ISO2709/MARCXML or array. Array is searchable by field and also
indexes all subfields in the _all field for searching.

Test plan:
1. Test that searching and indexing works with the patch without any changes.
2. Switch to array format and index some records.
3. Check e.g. the 008 field of a record and verify that the record can be found
with the contents enclosed in quotes.
4. Check that it's possible to search for a specific field/subfield. Search
query: marc_data_array.fields.655.subfields.a:Diaries
5. Check that tests still pass, especially t/Koha/SearchEngine/Elasticsearch.t

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #12 from Ere Maijala <[hidden email]> ---
Created attachment 87437
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=87437&action=edit
Bug 22258: Increase Elasticsearch maximum field count to 10000

Increases maximum field count from the default 1000 to 10000 to accommodate
large records and MARC as an array.

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #13 from Ere Maijala <[hidden email]> ---
Right. We have some new index fields and together with the MARC record as array
they blew the Elasticsearch default limit of 1000 fields per record. I added
configuration to raise the limit to 10000. The limit is just a safeguard in
Elasticsearch to guard against broken indexer blowing up the index.

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Michal Denar <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|Needs Signoff               |Signed Off

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Michal Denar <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
  Attachment #87436|0                           |1
        is obsolete|                            |
  Attachment #87437|0                           |1
        is obsolete|                            |

--- Comment #14 from Michal Denar <[hidden email]> ---
Created attachment 87440
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=87440&action=edit
Bug 22258: Elasticsearch: Add array as an alternative MARC format

Adds preference ElasticsearchMARCFormat that controls whether MARC records are
stored as ISO2709/MARCXML or array. Array is searchable by field and also
indexes all subfields in the _all field for searching.

Test plan:
1. Test that searching and indexing works with the patch without any changes.
2. Switch to array format and index some records.
3. Check e.g. the 008 field of a record and verify that the record can be found
with the contents enclosed in quotes.
4. Check that it's possible to search for a specific field/subfield. Search
query: marc_data_array.fields.655.subfields.a:Diaries
5. Check that tests still pass, especially t/Koha/SearchEngine/Elasticsearch.t

Signed-off-by: Michal Denar <[hidden email]>

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #15 from Michal Denar <[hidden email]> ---
Created attachment 87441
  -->
https://bugs.koha-community.org/bugzilla3/attachment.cgi?id=87441&action=edit
Bug 22258: Increase Elasticsearch maximum field count to 10000

Increases maximum field count from the default 1000 to 10000 to accommodate
large records and MARC as an array.

Signed-off-by: Michal Denar <[hidden email]>

Signed-off-by: Michal Denar <[hidden email]>

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

Alex Arnaud <[hidden email]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         QA Contact|                            |[hidden email]
                 CC|                            |[hidden email]

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/
Reply | Threaded
Open this post in threaded view
|

[Bug 22258] Elasticsearch - full record is not indexed in plain text

bugzilla-daemon
In reply to this post by bugzilla-daemon
https://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=22258

--- Comment #16 from Alex Arnaud <[hidden email]> ---
All is working here and i don't have any QA warning.

But i have concerns about Bug 20589. This last disable searching on _all and
could make this one useless. Am i right ?

if (yes) {
    Ere, David: Is there something planned to build these 2 bugs together ?
}

--
You are receiving this mail because:
You are watching all bug changes.
_______________________________________________
Koha-bugs mailing list
[hidden email]
http://lists.koha-community.org/cgi-bin/mailman/listinfo/koha-bugs
website : http://www.koha-community.org/
git : http://git.koha-community.org/
bugs : http://bugs.koha-community.org/