Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Search/Searcher Caching to Internal Server #620

Merged
merged 3 commits into from
Mar 8, 2022

Conversation

maneeshpm
Copy link
Contributor

@maneeshpm maneeshpm commented Oct 12, 2021

Fixes #509
This PR consists of three commits which are as follows:

  • Add a general-purpose LRU cache template

This general-purpose LRU cache template can be used to implement caching for various classes with various cache size limits.
NOTE: One must think about thread safety and race condition while using the template.

  • Implement caching on Searcher and Search

We use the new cache template to implement two kind of cache.
1: The Searcher cache is more general in terms of its usage. A Searcher can be used for multiple searches without much change to itself. We try to retrieve the Searcher and perform searches using it whenever possible, and if not we put a Searcher into the cache. Users can specify a custom cache length by manipulating the environment variable SEARCHER_CACHE_SIZE. Its default value is 10% of all the books available.
2: The Search cache is much more restricted in terms of usage. Its main purpose is to avoid re-searching on the Searcher during page changes to generate SearchResultSet of various ranges. Users can specify a custom cache length using the environment variable SEARCH_CACHE_SIZE with a default value of 2.

  • Implement caching on SuggestionSearcher

We create a cache for SuggestionSearcher very similar to that of FT searcher. User can specify a custom cache size using the environment variable SUGGESTION_SEARCHER_CACHE_SIZE. It has a default value of 10% of the number of books in the library.

@codecov
Copy link

codecov bot commented Oct 12, 2021

Codecov Report

Merging #620 (e8cdcf0) into master (833bbc8) will increase coverage by 8.00%.
The diff coverage is 77.27%.

❗ Current head e8cdcf0 differs from pull request most recent head 6523d9f. Consider uploading reports for the commit 6523d9f to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #620      +/-   ##
==========================================
+ Coverage   58.03%   66.04%   +8.00%     
==========================================
  Files          54       55       +1     
  Lines        3584     4102     +518     
  Branches     2019     2088      +69     
==========================================
+ Hits         2080     2709     +629     
+ Misses       1503     1392     -111     
  Partials        1        1              
Impacted Files Coverage Δ
src/server/internalServer.h 33.33% <ø> (ø)
src/tools/cache.hpp 71.69% <71.69%> (ø)
src/server/internalServer.cpp 82.54% <85.71%> (+1.48%) ⬆️
src/search_renderer.cpp 68.75% <0.00%> (-20.59%) ⬇️
src/library.cpp 78.26% <0.00%> (-4.53%) ⬇️
src/server/response.cpp 85.77% <0.00%> (-0.52%) ⬇️
src/aria2.cpp 0.00% <0.00%> (ø)
include/book.h 96.15% <0.00%> (ø)
src/version.cpp 0.00% <0.00%> (ø)
include/server.h 100.00% <0.00%> (ø)
... and 28 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 833bbc8...6523d9f. Read the comment docs.

Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few thing to change but the global structure is good.

src/tools/cache.cpp Outdated Show resolved Hide resolved
src/tools/cache.cpp Outdated Show resolved Hide resolved
src/tools/cache.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Show resolved Hide resolved
src/meson.build Outdated Show resolved Hide resolved
maneeshpm added a commit to openzim/libzim that referenced this pull request Nov 1, 2021
This change though seemingly insignificant till now, proved to be an
important aspect in kiwix/libkiwix#620. When the searcher is retrieved
from cache, it should start up in OP_AND instead of OP_OR.
@maneeshpm maneeshpm marked this pull request as ready for review November 1, 2021 09:17
Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still few changes to do but we are mostly good.

src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comments on the code.

But I've one comment coming from you, in the first commit message :

NOTE: One must think about thread safety and race condition while using
the template.

src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
src/server/internalServer.cpp Outdated Show resolved Hide resolved
@mgautierfr
Copy link
Member

I don't understand your comment about my comment on thread safety?

In the commit where you introduce the cache system, you write :

NOTE: One must think about thread safety and race condition while usingthe template.

I agree with you.
But then, in other commits, you don't protect the calls against race condition.
The server handles the requests in different threads, and so you must "think about thread safety and race condition" (and protect/modify the code accordingly)

@mgautierfr mgautierfr added this to the 10.1.0 milestone Dec 1, 2021
@kelson42
Copy link
Collaborator

kelson42 commented Dec 7, 2021

@maneeshpm Any news on this PR? It is one of the last one before releasing 10.0.0. This PR would also benefit of a proper automated test to secure the cache does work as intended and does not introduce regressions.

@maneeshpm
Copy link
Contributor Author

@mgautierfr aah ohk, so we are protecting suggestions module via mutex lock in libzim. You mean we need to do a similar treatment for FT search as well?

@kelson42 Sure, will do something about it.

@mgautierfr
Copy link
Member

You mean we need to do a similar treatment for FT search as well?

I was thinking about protecting the cache system (you explicitly say that in your commit message) but yes, we also need to protect the FT search.

@mgautierfr
Copy link
Member

I wonder if we should move the searcher/suggestionSearcher cache in the library itself (as for the readers/archives)
It would simplify a bit the code with a Library method to get the searcher from a bookId (or even better, a book name if we also move the namemapper into the library (not sure we need to do this in this pr)).


The mutex protection you've added is not enough. It is technically protect the internal structures but a race condition is still possible :

  • Two threads can try to get a searcher for book foo.
  • The two threads cannot found it.
  • They both create a searcher.
  • They both add the searcher to the cache.
  • Only one is added to the cache, but we use two different searcher. And when we create a search and put it in the cache, they will use two different searchers.

What we also need is to protect (block) the cache while we are creating the searcher to avoid the creation of two searcher.
It is not easy to do (if we want to do it efficiently). But we already have a implementation in libzim (https://github.com/openzim/libzim/blob/master/src/concurrent_cache.h) we may reuse it.

@maneeshpm
Copy link
Contributor Author

I wonder if we should move the searcher/suggestionSearcher cache in the library itself

@mgautierfr I agree, we are anyway pulling in the archive from the library, so it is more natural to get the searchers(and search via cache or otherwise) from the library itself rather than keeping it associated with the internal server. I propose we do it in a separate ticket after this.

Thanks for the suggestion, I'll get to see the proper implementation of an LRU cache rather than a simple one 😅 : I guess concurrent_cache.h in libzim is well written and tested!

@maneeshpm
Copy link
Contributor Author

@kelson42 I have traced the error to the newly added function getCacheLength() that is supposed to read an environment variable and return its value(if found) or return a default value provided by the user. Returning a hardcoded number in the function such as 1 causes the error to go away. I am not really sure why that function fails on mac.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 8, 2022

I wonder why in a first place we deal with an ENV variable here? IMO, either there is a given one by the user or we use a default value. I don't like the idea that the ENV plays a role here.

@kelson42
Copy link
Collaborator

kelson42 commented Jan 8, 2022

@maneeshpm But we have this kind of behaviour already in libzim and it works fine with macOS, maybe you could have a look to how this is done there?

@maneeshpm maneeshpm force-pushed the search_caching branch 2 times, most recently from e8cdcf0 to c5ed9c5 Compare January 9, 2022 12:42
@maneeshpm
Copy link
Contributor Author

maneeshpm commented Jan 9, 2022

Apparently, this problem was caused by a simple nullptr returned by getenv(). Handling that case separately rather than relying on the try catch block fixes the problem.

I am not sure of a technical explanation for why this is not a problem on linux but on mac, maybe @mgautierfr can help with one 😅

@kelson42
Copy link
Collaborator

kelson42 commented Jan 9, 2022

@mgautierfr I see this ticket is ready for final review pass!

@kelson42 kelson42 modified the milestones: 10.1.0, 10.0.1 Jan 9, 2022
@kelson42 kelson42 modified the milestones: 10.0.1, 10.1.0 Feb 3, 2022
Copy link
Member

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this late review @maneeshpm

It seems so have copied the lru_cache and ConcurrentCache from libzim.
There is no problem with this, but please change the commit message accordingly.
(And put a commit id of the version of the code you copy from. It will be simpler to know what will be the change in the future)

Else it seems we are good.

I am not sure of a technical explanation for why this is not a problem on linux but on mac, maybe @mgautierfr can help with one sweat_smile

The doc of std::string constructor (https://www.cplusplus.com/reference/string/string/string/) says that it is a undefined behavior if the pointer is null.
Maybe gcc (or its std library) create a empty string if we pass a null pointer and on mac a exception is raised.

@kelson42
Copy link
Collaborator

@maneeshpm Any feedback? We are really not far to merge this PR!

@kelson42
Copy link
Collaborator

@mgautierfr Considering that there is really not much to do and that @maneeshpm seems not available for the moment. Please feel free to just fix and merge.

The cache is copied from libzim project : https://github.com/openzim/libzim
The exact file as been copied from commit 27f5e70
We use the new cache template to implement two kind of cache.
1: The Searcher cache is more general in terms of its usage. A Searcher
   can be used for multiple searches without much change to itself. We
   try to retrieve the searcher and perform searches using it whenever
   possible, and if not we put a searcher into the cache. User can
   specify a custom cache length by manipulating the environment
   variable SEARCHER_CACHE_SIZE. It's default value is 10% of all the
   books available.
2: The search cache is much more restricted in terms of usage. It's main
   purpose is to avoid re-searching on the searcher during page changes
   to generate SearchResultSet of various ranges. User can specify a
   custom cache length using the environment variable SEARCH_CACHE_SIZE
   with a default value of 2;
We create a cache for SuggestionSearcher very similar to that of FT
searcher. User can specify a custom cache size using the environment
variable SUGGESTION_SEARCHER_CACHE_SIZE. It has a default value of 10%
of the number of books in the library.
@mgautierfr
Copy link
Member

I've rebase on master and redo a bit the first commit which is a copy of file from libzim.
I haven't changed the other commits from @maneeshpm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

We should avoid to search again from beggining when changing search page.
3 participants