Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter ZIMs by humanid & aliases #808

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Filter ZIMs by humanid & aliases #808

wants to merge 6 commits into from

Conversation

juuz0
Copy link
Collaborator

@juuz0 juuz0 commented Aug 15, 2022

Fixes #807

@juuz0 juuz0 requested a review from kelson42 August 15, 2022 16:46
@juuz0
Copy link
Collaborator Author

juuz0 commented Aug 15, 2022

Example test link:
http://192.168.18.8:8080/?book=termux_en_all_maxi_2022-05&book=installgentoo_en_all_nopic_2019-09
image

@kelson42
Copy link
Collaborator

kelson42 commented Aug 15, 2022

Pretty sure that installgentoo_en_all_nopic_2019-09 is not the zim name metadata. It's the human friendly name.

@juuz0
Copy link
Collaborator Author

juuz0 commented Aug 15, 2022

It's the filename (whatever it is on system, you can try changing a file name and see).
ZIM's name field is not unique as per my observation (for example archwiki_en_all__nopic and archwiki_en_all__maxi have the same name: archwiki_en_all)

@juuz0 juuz0 changed the title Filter ZIMs by filename in catalog/search Filter ZIMs by alias name Aug 15, 2022
@juuz0
Copy link
Collaborator Author

juuz0 commented Aug 15, 2022

@kelson42 Updated to use alias name.
image

@codecov
Copy link

codecov bot commented Aug 15, 2022

Codecov Report

Base: 70.51% // Head: 71.13% // Increases project coverage by +0.62% 🎉

Coverage data is based on head (151106d) compared to base (86eacea).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head 151106d differs from pull request most recent head 2cd0579. Consider uploading reports for the commit 2cd0579 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #808      +/-   ##
==========================================
+ Coverage   70.51%   71.13%   +0.62%     
==========================================
  Files          53       53              
  Lines        3653     3756     +103     
  Branches     2029     2089      +60     
==========================================
+ Hits         2576     2672      +96     
- Misses       1075     1082       +7     
  Partials        2        2              
Impacted Files Coverage Δ
include/name_mapper.h 71.42% <ø> (ø)
include/library.h 100.00% <100.00%> (ø)
src/library.cpp 83.97% <100.00%> (+0.56%) ⬆️
src/name_mapper.cpp 100.00% <100.00%> (ø)
src/server/internalServer.cpp 88.09% <100.00%> (+0.14%) ⬆️
src/server/response.cpp 92.06% <0.00%> (-1.60%) ⬇️
src/server/internalServer_catalog_v2.cpp 92.85% <0.00%> (-0.17%) ⬇️
include/server.h 100.00% <0.00%> (ø)
src/server/i18n.h 100.00% <0.00%> (ø)
src/server/response.h 100.00% <0.00%> (ø)
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@kelson42
Copy link
Collaborator

@juuz0 Thx, it's great if can now work with both filename and aliases. I will test it. Reading the source code, I have too remarks:

  • supporting of aliases shoud probably works only if --nodatealiases is used (just to be coherent)
  • I would reuse the same primitive like there to compute the alias name

@kelson42 kelson42 changed the title Filter ZIMs by alias name Filter ZIMs by humanid & aliases Aug 16, 2022
@kelson42
Copy link
Collaborator

@juuz0 Any feedback?

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No unit tests have been added!

src/library.cpp Outdated Show resolved Hide resolved
src/library.cpp Outdated Show resolved Hide resolved
src/library.cpp Outdated Show resolved Hide resolved
@@ -119,6 +119,9 @@ Filter get_search_filter(const RequestContext& request, const std::string& prefi
try {
filter.rejectTags(kiwix::split(request.get_argument(prefix+"notag"), ";"));
} catch (...) {}
try {
filter.aliasName(request.get_argument(prefix + "book"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that selection of books by alias/filename should not be combined with other filtering criteria. If so, then it rather must be done in InternalServer::selectBooks() (similar to selection by id or name).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't understand this fully. We need a similar way to filter using query string (just like lang/category/tag)

Copy link
Collaborator

@veloman-yunkan veloman-yunkan Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was not the point of my comment. Do the following queries make sense?

  • catalog/v2/entries?book=<bookid>&lang=eng
  • catalog/v2/entries?book=<bookid1>&book=<bookid2>&category=wikipedia
  • catalog/v2/entries?book=<bookid>&tag=nopic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veloman-yunkan Not sure to understand your remark either (is that a principle remark about filtering by humanid or a just an implementation remark). Your example don't include any "human_id" argument? A request of the type catalog/v2/entries?book=<bookid>&lang=eng can not fullfill the feature request because it is not stable over time. The goal here is to have a way to selection an arbitrary list of ZIM file over time. AFAIK, using the human ids and there corresponding aliases is the only way for the moment... and so far it has works fine for the Web URLS (for example library.kiwix.org/wikipedia_en_all_maxi/).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kelson42 In my examples <bookid> is supposed to be the "human_id". The examples illustrate the following cases:

  • selection by "human_id" is combined with filtering by language
  • selection by "human_id" is combined with filtering by category
  • selection by "human_id" is combined with filtering by tag

In my understanding, selection by human_id should be mutually exclusive with other ways of content filtering.

Copy link
Collaborator

@kelson42 kelson42 Sep 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@veloman-yunkan I understand you now. I'm not in favour of hardwiring exclusivity but yes, your examples don't really make sense. For example, if the humanid is for a book in Armenian, then if lang=eng then no result, but if lang=arm then if will give one result. Etc... To me your remark left is mostly about where to implement things, and on that I have no opinion/fully trust you.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kelson42 My vision is that human_id must not be combined with other selection criteria. My original proposal applies only to such a use model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kelson42 In other words selection by humanid should work similar to selection by name or uuid (in which case other filtering criteria are ignored).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing fondamental against this, even if I would prefer to avoid a special treatment. @juuz0 good to you to fix the points noticed by @veloman-yunkan ?

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please get rid of the fixup commits for the next review.

Comment on lines +54 to +57
std::string HumanReadableNameMapper::removeDateFromBookId(const std::string& bookId) {
return replaceRegex(bookId, "", "_[[:digit:]]{4}-[[:digit:]]{2}$");
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extract this function in a separate preparatory commit.

@@ -119,6 +119,9 @@ Filter get_search_filter(const RequestContext& request, const std::string& prefi
try {
filter.rejectTags(kiwix::split(request.get_argument(prefix+"notag"), ";"));
} catch (...) {}
try {
filter.aliasName(request.get_argument(prefix + "book"));
Copy link
Collaborator

@veloman-yunkan veloman-yunkan Sep 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was not the point of my comment. Do the following queries make sense?

  • catalog/v2/entries?book=<bookid>&lang=eng
  • catalog/v2/entries?book=<bookid1>&book=<bookid2>&category=wikipedia
  • catalog/v2/entries?book=<bookid>&tag=nopic

Comment on lines +503 to +520
TEST_F(LibraryTest, filterByAliasNames)
{
// filtering for one book
EXPECT_FILTER_RESULTS(kiwix::Filter().aliasNames({"zimfile"}),
"Ray Charles"
);

// filerting for more than one book
EXPECT_FILTER_RESULTS(kiwix::Filter().aliasNames({"zimfile", "example"}),
"An example ZIM archive",
"Ray Charles"
);

// filtering by alias name requires full text match
EXPECT_FILTER_RESULTS(kiwix::Filter().aliasNames({"wrong_name"}),
/* no results */
);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was rather expecting new tests in test/library_server.cpp, that in particular would demonstrate how the new filter is expected to interact with other filters?

Extracts the duplicate code from publisherQuery() and nameQuery() into a new function parseQuery().
Adds mechanism to get a ZIM using its alias name.
To make a search, one needs to visit:
`TLD/?book=aliasNameHere`
Adds support for putting multiple `book` query parameter.
Adds test to check if filtering by alias name works.
@mgautierfr
Copy link
Member

I want to mention this new issue #828
Please remember the server can be run with a namemapper using the book uuid as a name (https://github.com/kiwix/libkiwix/blob/master/include/name_mapper.h#L41-L45) and it is the default name mapper for kiwix::Server.

Be sure that you are not assuming too much that we will use a HumanReadableNameMapper

@stale
Copy link

stale bot commented Nov 2, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Nov 2, 2022
@kelson42
Copy link
Collaborator

kelson42 commented Nov 8, 2022

@juuz0 Any feedback?

@stale stale bot removed the stale label Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filter ZIMs by humanid & aliases
4 participants