Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fully xapian powered catalog search #488

Merged
merged 32 commits into from
Apr 27, 2021

Conversation

veloman-yunkan
Copy link
Collaborator

@veloman-yunkan veloman-yunkan commented Apr 11, 2021

Fixes #484 with the exception of the local, remote, valid and maxSize filtering criteria. Those can be processed via Xapian too, but is it worth the effort (though we will gain some uniformity in how a query is fulfilled - no postprocessing through Filter::accept() will be needed)?

Also

  • adds diacritics insensitivity to catalog filtering by title/description
  • the unit-test of Library::filter() is greatly enhanced

@codecov
Copy link

codecov bot commented Apr 11, 2021

Codecov Report

Merging #488 (63e9a09) into master (d134ad4) will increase coverage by 1.21%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #488      +/-   ##
==========================================
+ Coverage   63.18%   64.39%   +1.21%     
==========================================
  Files          50       50              
  Lines        3504     3556      +52     
  Branches     1773     1816      +43     
==========================================
+ Hits         2214     2290      +76     
+ Misses       1288     1264      -24     
  Partials        2        2              
Impacted Files Coverage Δ
include/library.h 91.66% <100.00%> (+11.66%) ⬆️
src/library.cpp 78.34% <100.00%> (+10.25%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d134ad4...63e9a09. Read the comment docs.

@veloman-yunkan veloman-yunkan force-pushed the fully_xapian_powered_catalog_search branch 4 times, most recently from 9f307b3 to d63ca4a Compare April 13, 2021 19:54
@veloman-yunkan veloman-yunkan changed the title [WIP] Fully xapian powered catalog search Fully xapian powered catalog search Apr 13, 2021
@veloman-yunkan veloman-yunkan marked this pull request as ready for review April 13, 2021 20:12
@veloman-yunkan veloman-yunkan force-pushed the fully_xapian_powered_catalog_search branch from d63ca4a to 1d8e710 Compare April 16, 2021 08:48
@veloman-yunkan
Copy link
Collaborator Author

Incorporated @maneeshpm's fix of openzim/libzim#534 into the commits titled "Handling of non-words in publisher query" and "Catalog filtering by creator works via Xapian" (respectively commits 7bfd050 and 109e54d in this revision of the PR).

@veloman-yunkan veloman-yunkan force-pushed the fully_xapian_powered_catalog_search branch from 1d8e710 to 459da6a Compare April 16, 2021 18:24
@kelson42
Copy link
Collaborator

@mgautierfr It is important this PR is reviewed so @veloman-yunkan can move on.

Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a lot to say, code is good.
Just two questions.

test/library.cpp Outdated Show resolved Hide resolved
src/library.cpp Outdated Show resolved Hide resolved
@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Apr 21, 2021

Not a lot to say, code is good.
Just two questions.

@mgautierfr And what about the question raised in the description of the PR?

@mgautierfr
Copy link
Member

mgautierfr commented Apr 27, 2021

@veloman-yunkan Please rebase on top of master (and merge yourself if I took too much time to come back on this PR)

This should have been done back in PR #460
Moved the `filter.hasQuery()` check inside `buildXapianQuery()`.
`Library::filterViaBookDB()` only cares if the query that is going to be
run on the book DB would match all documents. The rest of changes
related to enhancing the usage of Xapian for the catalog search will
happen inside `buildXapianQuery()` and `updateBookDB()`.
This change fixes the failure of the LibraryTest.filterByPublisher
unit-test broken by the previous commit.

The previous approach used in `publisherQuery()` for building a phrase
query enforcing the specified prefix for all terms fails if

1. the input phrase contains a non-word term that Xapian's query parser
   doesn't like (e.g. a standalone ampersand character, 1/2, a#1, etc);
2. the input phrase contains at least three terms that Xapian's query
   parser has no issue with.

Using the `quest` tool (coming with xapian-tools under Ubuntu) the
issue can be demonstrated as follows:

```
$ quest -o phrase -d some_xapian_db "Energy & security"
Parsed Query: Query((energy@1 PHRASE 11 Zsecur@2))
Exactly 0 matches
MSet:

$ quest -o phrase -d some_xapian_db "Energy & security act"
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries

$ quest -o phrase -d some_xapian_db 'Energy 1/2 security act'
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries

$ quest -o phrase -d some_xapian_db "Energy a#1 security act"
UnimplementedError: OP_NEAR and OP_PHRASE only currently support leaf subqueries
```

The problem comes from parsing the query with the default operation set
to `OP_PHRASE` (exemplified by the `-o phrase` option in above
invocations of `quest`). A workaround is to parse the phrase with a
default operation of `OP_OR` and then combine all the terms with
`OP_PHRASE`.

Besides stemming should be disabled in order to target an exact phrase
match (save for the non-word terms, if any, that are ignored by the
query parser).
Catalog filtering by titles/description was sensitive to diacritics
present in the query string. Fixed that.

Also enhanced the unit test to validate the insensitivity to diacritics
present in either the title/description or the query string.
Catalog filtering should now be case/diacritics insensitive for all
fields. However it is not validated for language, name and category
fields, and is validated for tags, creator & publisher only for text
supplied in the filter (but not for values read from the book).
The library set up by LibraryTest now contains two valid books
initialized via XML. Therefore XmlLibraryTest is not needed as a
separate test suite.
@veloman-yunkan veloman-yunkan force-pushed the fully_xapian_powered_catalog_search branch from 845dfda to 63e9a09 Compare April 27, 2021 13:00
@veloman-yunkan
Copy link
Collaborator Author

@veloman-yunkan Please rebase on top of master (and merge yourself if I took too much time to come back on this PR)

@mgautierfr I don't have enough privileges in this repository

@mgautierfr mgautierfr merged commit 7336dca into master Apr 27, 2021
@mgautierfr mgautierfr deleted the fully_xapian_powered_catalog_search branch April 27, 2021 13:10
@mgautierfr
Copy link
Member

I don't have enough privileges in this repository

Wow, I will see with to change that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[library] Use Xapian for library search on all fields
3 participants