-
-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add zim files in the "new" format as testing data. #535
Conversation
Codecov Report
@@ Coverage Diff @@
## master #535 +/- ##
==========================================
+ Coverage 79.30% 79.88% +0.58%
==========================================
Files 91 91
Lines 3744 3748 +4
Branches 1701 1661 -40
==========================================
+ Hits 2969 2994 +25
+ Misses 774 753 -21
Partials 1 1
Continue to review full report at Codecov.
|
e070817
to
f5454c6
Compare
@legoktm This PR fails on bionic. It seems that it cannot find the gtest library. But nothing has changed on this side. |
I think that's just meson checking to see whether gtest is installed or not. The real error appears to be:
which corresponds to the line you added:
bionic uses the meson 0.45.1, which might not have that feature yet? |
Yes, it is probably that. I've missed the line. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the main objective of this PR. My impression from #531 was that it was preparing ground for validating #518. Therefore my recommendation was not to aim at updating all tests at once, but create a few unit tests focusing on #518 using this new approach for test data management. As a result of my questions asked in #531 we now have a bigger PR with a larger and not so well defined scope - it both updates a lot of unit-tests and fixes some issues with Archive::iterByTitle()
.
test/tools.cpp
Outdated
std::string dataDir; | ||
setDataDir(dataDir); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing. Why not const std::string dataDir = getDataDir();
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As said in the comment inside setDataDir
, the gtest FAIL
method can be used only in a void function. So we have to use this surprising (I agree) signature.
The main objective of this PR is to run all our unittests with zim files using the new "format" (no namespace, some fix on the metadata of the xapian db, ...). |
IMHO, before initiating any real changes in this direction we must have some vision. Of course exploratory/experimental changes are a way of learning important things that may influence the final vision/plan, but we should at least treat them as such in the beginning. So, to the best of my understanding you are trying to establish an approach for dealing with the need to test backward compatibility of libzim on multiple flavours of ZIM archive content. We need to realize that if ZIM archives have evolved before they will likely evolve in the future too (though we may be forced to limit the rate at which new flavours are introduced to keep the full current diversity of ZIM archive flavours manageable). The proposed approach for running the same test on multiple variants of the "same" ZIM archive must be as future-proof as possible. To this end, we better first understand what kind of diversity we will be dealing with. Generally speaking, a ZIM archive flavour is a combinations of multiple features such as:
Some tests will be concerned with only a single dimension in the total space of all possible ZIM flavours. Question: do we want to handle these and the next paragraph as special cases of the same general approach? On the other hand, not all combinations are valid (old namespace scheme would never occur with a front article index). The latter observation suggests that a linear/versionable sequence of typical ZIM content flavours (i.e. a flavour of a ZIM archive that could have been created by libzim between such and such dates) also makes some sense (similar to ZIM format versions). Some high level tests will probably need to run on all those versions of ZIM archives (or, rather, starting from a particular version where the requisite feature first appears). BTW, it is possible to take an approach where the test code is written to work with a single version of the given ZIM file (avoiding the boilerplate of iterating over ZIM files in each test case) - the responsibility of testing the given functionality on multiple flavours can be shifted onto a driver script that runs the same test in different environments effectively feeding different data to it. The tests can have certain metadata attached to them that will instruct the driver script which inputs to exercise the unit-test on. |
I totally agree with you. Except that it is not in the scope for now :) But for now this is not the scope. That is why I don't want to add more binary file to this repository : I'm pretty sure this is not the good solution and the test files will be dropped. And I don't want to increase the git history with file which will be removed in the next future but user will have to download all the time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main goal for now is to finish libzim7 and do a release.
If we are going take shortcuts, then I will stop being too picky. I am leaving my feedback as comments. Consider it as an approval based on your judgement.
The "Fix 'iterByTitle'" doesn't fit well into this PR.
Naming the old variant of ZIM files "normal" is not a good idea. Soon nons will be the normal version before being overridden by something else. Use a name with a more stable meaning (e.g. withns or prelibzim7).
test/archive.cpp
Outdated
for(auto& testfile: getDataFilePath("invalid.bad_mimetype_in_dirent.zim")) { | ||
std::string expected; | ||
if (testfile.category == "normal") { | ||
expected = "Entry M/Language has invalid MIME-type value 1234.\n"; | ||
} else { | ||
expected = "Entry M/Scraper has invalid MIME-type value 1234.\n"; | ||
} | ||
EXPECT_BROKEN_ZIMFILE(testfile.path, expected) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to generate invalid.bad_mimetype_in_dirent.zim
in such a way that the error message is the same across all of its variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partly agree.
It would be better on the test side yes.
But then we would have a evolution in the script generating the test data without a easy way to know which version to use.
We should handle this issue correctly when we will design a correct testing suite.
auto count = 0; | ||
for(auto& entry: range0) { | ||
count++; | ||
ASSERT_EQ(entry.getPath().find("Першая_старонка.html"), 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't ASSERT_EQ(entry.getPath(), "Першая_старонка.html")
a better check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
archive.findByPath
returns a range of entries starting by the given path.
So the "real" semantic check is that entry.getPath
start by the given path, even if there is only one result.
I've used the same check that ByPath
test.
Why ? @veloman-yunkan Do you still prefer to add the test data in this repository or it is ok for you that add the new test file in repository |
968ce9b
to
a6fee67
Compare
Since we will have to share test ZIM archives between multiple kiwix-projects we will have to use a separate repository for them eventually, so it doesn't make sense to defer that. But there is one thing that must be done before that. If I have for whatever reason to travel back in time and work with an old version of source code, I want to know which version of test data that project used at that time (in other words |
91eab3b
to
a6fee67
Compare
187ea4e
to
95c4aac
Compare
We may have several "same" (variant) zim files to test. Let's `getDataFilePath` return a list of path of those "same" files and test each paths with the same test.
…ata. Instead of copying those files in the build dir, directly use them in the source dir.
- Now have subdirectory for each "variant". - Two variants : withns and nons.
If we have a specific title index, we must iterate on it. We must not iterate on a (wrong) subset of the entries. Related to #514
This allow tests to have information about the "kind" of the test file. This is not used for now but will be in next commit. It is also possible to force a particular category for the test. (Not used neither here).
New zim files have totally different indexes.
The indexes are changed between test zim files. Indexes are totally internal to zim files and don't have to be stable.
The "Main Page" entry is a redirection pointing to the real main page. But now, the redirection itself is not searchable, so we cannot search for "Main Page". We have to use the real title of the target item. On wikibooks_be_all_nopic_2017-02, the new zim file now have a titleIndex so we run the "search test" in `checkEquivalence`. But wikibooks_be_all_nopic_2017-02 contains two items with the title of the mainPage : - The (real) entry itself. - A redirection pointing to the real entry. So, we need to also resolve the redirection of the search result to check that path correspond.
The paths are different by definition. Let's create another test.
Some error message are slightly different between normal and nons test zim files.
5f1f261
to
6bc2dab
Compare
This PR somehow invalidate #531
(Moving the zim file in a different repository is still a open idea. But it can be done after this PR. Although it would be pretty useless. Adding the binary files to git will grow the total git repository size even if we remove them just after).
First commits prepare a bit the tests tools to be prepared to test several "same" test files.
After we add the new test files (which can be replaced by the download presented in #531)
Then we update the tests for the new test files.
It also "cherry-pick" #518 which now because obsolete.