Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A simple unit-test for zim::Search doesn't work #471

Closed
veloman-yunkan opened this issue Jan 18, 2021 · 8 comments · Fixed by #530
Closed

A simple unit-test for zim::Search doesn't work #471

veloman-yunkan opened this issue Jan 18, 2021 · 8 comments · Fixed by #530
Assignees
Milestone

Comments

@veloman-yunkan
Copy link
Collaborator

While working on #449 I encountered problems with search. It turns out that at least part of those problems are not related to my changes since they are present in the master branch too. The following simple unit test fails:

TEST(Search, searchByTitle)
{
  const zim::Archive archive("./data/wikipedia_en_climate_change_nopic_2020-01.zim");
  ASSERT_TRUE(archive.hasTitleIndex());
  const zim::Entry mainEntry = archive.getMainEntry();
  zim::Search search(archive);
  search.set_suggestion_mode(true);
  search.set_query(mainEntry.getTitle()); // mainEntry.getTitle() == "Main Page"
  ASSERT_NE(0, search.get_matches_estimated());
  ASSERT_EQ(mainEntry.getPath(), search.begin().get_url());
}
$ ./search 
Running main() from /usr/src/googletest/googletest/src/gtest_main.cc
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from Search
[ RUN      ] Search.searchByTitle
../../../../../home/leon/freelancing/upwork/kiwix/builddir/SOURCE/libzim/test/search.cpp:37: Failure
Expected: (0) != (search.get_matches_estimated()), actual: 0 vs 0
[  FAILED  ] Search.searchByTitle (15 ms)
[----------] 1 test from Search (15 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (15 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] Search.searchByTitle

Originally, I used the recently created ZIM archive test/data/small.zim in the unit test, but to eliminate the possibility that that ZIM file has problems due to recent changes in libzim I reproduced the same failure on the old ZIM file wikipedia_en_climate_change_nopic_2020-01.zim.

@mgautierfr
Copy link
Collaborator

The main entry is a generated entry which is a redirection to the real main page.
The entry is not handled by the xapian indexer and so is not "searchable".

If you follow the redirection and do a query using the real entry title it should work. (If not we have a bug).

@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Jan 18, 2021

Trying to resolve the redirect (const zim::Entry mainEntry = archive.getMainEntry().getRedirectEntry();) throws an exception when run on wikipedia_en_climate_change_nopic_2020-01.zim:

[ RUN      ] Search.searchByTitle
unknown file: Failure
C++ exception with description "Entry 7323 is not a redirect entry." thrown in the test body.
[  FAILED  ] Search.searchByTitle (8 ms)

When using small.zim it fails in a different way:

[ RUN      ] Search.searchByTitle
../../../../../home/leon/freelancing/upwork/kiwix/builddir/SOURCE/libzim/test/search.cpp:38: Failure
Expected equality of these values:
  mainEntry.getPath()
    Which is: "main.html"
  search.begin().get_url()
    Which is: ""
[  FAILED  ] Search.searchByTitle (1 ms)

Regenerating small.zim using a fresh version of zimwriterfs doesn't help.

@mgautierfr
Copy link
Collaborator

Trying to resolve the redirect (const zim::Entry mainEntry = archive.getMainEntry().getRedirectEntry();) throws an exception when run on wikipedia_en_climate_change_nopic_2020-01.zim:

On old zim file, the mainEntry is directly the "real main entry". On new zim file is a redirection to it.
We need to check if it is a redirection before calling getRedirectEntry (or use getItem(true) which return a item, following redirection if needed, and the path of a entry is the same than the associated item)

When using small.zim it fails in a different way:

This is a bug.

@kelson42
Copy link
Contributor

@veloman-yunkan @mgautierfr what next? This is preety urgent that this is fixed and that the CI works again, so we can merged the PRs whoch are pilling up since december.

@kelson42 kelson42 added this to the libzim 7.0.0 milestone Jan 23, 2021
@veloman-yunkan
Copy link
Collaborator Author

veloman-yunkan commented Jan 24, 2021

It turns out that by default zim::search's results are empty, despite a correct statistics returned by get_matches_estimated(). The reason is that zim::search is created with an empty range:

Search::Search(const std::vector<const File*> zimfiles) :
    internal(new InternalData),
    zimfiles(zimfiles),
    prefixes(""), query(""),
    latitude(0), longitude(0), distance(0),
    range_start(0), range_end(0), // !!!!!!!!!!!!!!!!!!!
    ...
Search::Search(const File* zimfile) :
    internal(new InternalData),
    prefixes(""), query(""),
    latitude(0), longitude(0), distance(0),
    range_start(0), range_end(0), // !!!!!!!!!!!!!!!!!!!
    ...

Then, when the query is executed, 0 is passed for the maxitems parameter of Xapian::Enquire::get_mset(). Its documentation reads:

maxitems the maximum number of items to return. If you want all matches, then you can pass the result of calling get_doccount() on the Database object (though if you are doing this so you can filter results, you are likely to get much better performance by using Xapian's match-time filtering features instead). You can pass 0 for maxitems which will give you an empty MSet with valid statistics (such as get_matches_estimated()) calculated without looking at any postings, which is very quick, but means the estimates may be more approximate and the bounds may be much looser.

Explicitly providing in the unit test a query range via search.set_range(0, 1); before the call to get_matches_estimated() makes it pass. Calling get_matches_estimated() first and then using its result to set the search range doesn't work, since a subsequent call to search.begin() reuses the cached state initialized by the call to search.get_matches_estimated().

@kelson42
Copy link
Contributor

@veloman-yunkan Thank you very much for this analysis.

To me, and if I understand properly, the following two things are "buggy":

  • default range being 0-0
  • search_begin() dealing with "wrong values"

IMO, we need a default range set, but not sure if this should be the estimation (does this max range estimation is expensive to run?) or a fix value. The workflow needs obviously to be fixed to be more robust and stop doing unexpected things "in the back" of the user.

@kelson42
Copy link
Contributor

Might be related to #463 as well

@mgautierfr
Copy link
Collaborator

Indeed, the Search class was mainly use through the File::search method (

libzim/src/file.cpp

Lines 191 to 204 in b3e64fe

std::unique_ptr<Search> File::search(const std::string& query, int start, int end) const {
auto search = std::unique_ptr<Search>(new Search(this));
search->set_query(query);
search->set_range(start, end);
return search;
}
std::unique_ptr<Search> File::suggestions(const std::string& query, int start, int end) const {
auto search = std::unique_ptr<Search>(new Search(this));
search->set_query(query);
search->set_range(start, end);
search->set_suggestion_mode(true);
return search;
}
)
This has been removed with the new api and now we face some inconsistency with the Search class.

We probably have to rethink a bit the api (and the internal structure)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants