Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink search API to fix several issues. #530

Merged
merged 13 commits into from
May 12, 2021
Merged

Conversation

mgautierfr
Copy link
Collaborator

@mgautierfr mgautierfr commented Apr 6, 2021

Fix #463, #516, #471.
Related to https://github.com/kiwix/kiwix-tools/issues/97

First commits roughly change the API to have a correct public API (but somehow wrong internal structures).
Later commits fix the internal structures to have something correct.

The main idea is to have :

  • A Searcher, wrapping a xapian database.
  • A Search, wrapping a particular query on the xapian database.
  • A SearchResult, a set of result (range) corresponding to the Search.

We keep the iterator, but now we iterate over the SearchResult
We also introduce a Query object, describing a query (from user point of view), not associated to any database.

@codecov
Copy link

codecov bot commented Apr 6, 2021

Codecov Report

Merging #530 (b00e635) into master (479d9f3) will increase coverage by 1.24%.
The diff coverage is 79.79%.

❗ Current head b00e635 differs from pull request most recent head ff16617. Consider uploading reports for the commit ff16617 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #530      +/-   ##
==========================================
+ Coverage   80.63%   81.88%   +1.24%     
==========================================
  Files          91       91              
  Lines        3749     3753       +4     
  Branches     1666     1672       +6     
==========================================
+ Hits         3023     3073      +50     
+ Misses        725      679      -46     
  Partials        1        1              
Impacted Files Coverage Δ
include/zim/search_iterator.h 100.00% <ø> (ø)
src/search.cpp 73.68% <76.31%> (+9.49%) ⬆️
src/search_iterator.cpp 84.61% <86.95%> (+13.92%) ⬆️
include/zim/search.h 100.00% <100.00%> (ø)
src/search_internal.h 100.00% <100.00%> (+25.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 479d9f3...ff16617. Read the comment docs.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. It would be nice if the API change is summarized in the PR description.
  2. It is important to state the thread-safety guarantees of the search API.
  3. The changes are not sufficiently covered by the unit tests.

src/search.cpp Outdated Show resolved Hide resolved
include/zim/search.h Outdated Show resolved Hide resolved

Searcher& add_archive(const Archive& archive);

Search search(bool suggestionMode);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if having separate Searcher objects per search mode would be more appropriate (i.e. suggestionMode should be fixed at construction time).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the suggestionMode in the Query in later commits.
But it doesn't change the point you've raised here.

Having a Searcher per mode means we have to create a different searcher for different query.
But the idea is that the Searcher is the wrapper around a database. Although we internally have two different InternalDataBase for now (it is a implementation details), conceptually, there is only one database and we can do two (or three with georange) different kinds of search, on the same searcher.

src/search.cpp Outdated Show resolved Hide resolved
src/search.cpp Outdated Show resolved Hide resolved
src/search.cpp Outdated Show resolved Hide resolved
src/search.cpp Outdated Show resolved Hide resolved
src/search.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@maneeshpm maneeshpm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy with the new structure this PR brings about. It should make future enhancement/maintenance of search more localized and manageable. Though I cannot comment on the technicality(you yourself and veloman are experts here), from a usage point of view I have a couple of suggestions

src/search.cpp Outdated Show resolved Hide resolved
@mgautierfr
Copy link
Collaborator Author

It would be nice if the API change is summarized in the PR description.

Done

It is important to state the thread-safety guarantees of the search API.

We need a document on the API (and a explanation there about the thread-safety).
But globally, there is no thread-safety. Caller must protect against race condition.

The changes are not sufficiently covered by the unit tests.

Agree (we are still in WIP state).

@mgautierfr mgautierfr changed the base branch from master to better_tests_data April 20, 2021 15:27
@mgautierfr mgautierfr force-pushed the better_tests_data branch 3 times, most recently from 5f1f261 to 6bc2dab Compare April 28, 2021 13:02
Base automatically changed from better_tests_data to master April 28, 2021 13:45
Obviously the `initDatabase` should not be const.
But this is mainly a code move.
The const(and api) will be fixed in next commit.
The InternalDataBase class encapsulate all information we can have about
a xapian database.
This commit concentrate the change about regrouping the information in
only one class.

The use of the class itself is done the right way. But it will be changed
in next commits.
The searcher is a wrapper around a xapian database.
It allow to create searches.

This commit concentrate the change around the introduction of the
`Searcher` class. The `Search` is not modified.
@mgautierfr mgautierfr force-pushed the fix_search_api_memlink branch 2 times, most recently from e991f23 to 85e2c31 Compare May 11, 2021 10:13
@mgautierfr
Copy link
Collaborator Author

Rebased on master and resolve conficts.

New commits should fix the comments raised by previous review.

Last commits add unittest on somehow edge case iterator usage. I've fixed the minimum to avoid libzim crashing.
We should homogenize the behavior/API (throw exception all the time, move to camelCase api) around the search_iterator but I prefer to move this in another PR.

We are not testing geoquery search, this is why we don't have a 90% coverage on the patch (but code doesn't really change here)

@mgautierfr mgautierfr marked this pull request as ready for review May 11, 2021 13:13
Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor issues.

src/search_iterator.cpp Outdated Show resolved Hide resolved
test/search_iterator.cpp Outdated Show resolved Hide resolved
src/search_internal.h Outdated Show resolved Hide resolved
src/search_iterator.cpp Outdated Show resolved Hide resolved
Copy link
Collaborator

@maneeshpm maneeshpm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR! The tests demonstrating the usage are super useful. I don't have much to add to veloman-yunkan's comments. Just a minor adjustment that can be useful for the future.

src/search.cpp Show resolved Hide resolved
A `SearchResultSet` represent a particular range of search results.
It can be iterated.

This allow the user to reuse a `Search` to get different range results.
By using a `Query` object, we avoid to make our `Search` configurable.
And so, we avoid a user calling a potential `setQuery` on a `Search`
after `getRange` has been called.
As the QueryParser is only dependent of the information in the database
we can directly create it as we create the database.

We don't need a specific method to configure a QueryParser every time.

As m_language and m_stopwords were used only to configure the
queryParser, we don't need to store them in the database.
Parsing the query can be made entirely in the database, so let's move
it there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants