Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add zim files in the "new" format as testing data. #535

Merged
merged 14 commits into from
Apr 28, 2021

Conversation

mgautierfr
Copy link
Collaborator

@mgautierfr mgautierfr commented Apr 15, 2021

This PR somehow invalidate #531
(Moving the zim file in a different repository is still a open idea. But it can be done after this PR. Although it would be pretty useless. Adding the binary files to git will grow the total git repository size even if we remove them just after).

First commits prepare a bit the tests tools to be prepared to test several "same" test files.
After we add the new test files (which can be replaced by the download presented in #531)
Then we update the tests for the new test files.

It also "cherry-pick" #518 which now because obsolete.

@codecov
Copy link

codecov bot commented Apr 15, 2021

Codecov Report

Merging #535 (91eab3b) into master (75fcc61) will increase coverage by 0.58%.
The diff coverage is 91.66%.

❗ Current head 91eab3b differs from pull request most recent head 6bc2dab. Consider uploading reports for the commit 6bc2dab to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #535      +/-   ##
==========================================
+ Coverage   79.30%   79.88%   +0.58%     
==========================================
  Files          91       91              
  Lines        3744     3748       +4     
  Branches     1701     1661      -40     
==========================================
+ Hits         2969     2994      +25     
+ Misses        774      753      -21     
  Partials        1        1              
Impacted Files Coverage Δ
src/writer/handler.h 100.00% <ø> (ø)
src/writer/counterHandler.cpp 87.50% <87.50%> (ø)
src/archive.cpp 62.98% <100.00%> (+4.32%) ⬆️
src/fileimpl.cpp 85.62% <100.00%> (+1.57%) ⬆️
src/fileimpl.h 90.90% <100.00%> (+0.90%) ⬆️
src/writer/counterHandler.h 100.00% <100.00%> (ø)
src/writer/creator.cpp 83.49% <100.00%> (ø)
src/writer/titleListingHandler.h 100.00% <100.00%> (ø)
src/writer/xapianHandler.h 100.00% <100.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75fcc61...6bc2dab. Read the comment docs.

@mgautierfr mgautierfr force-pushed the better_tests_data branch 4 times, most recently from e070817 to f5454c6 Compare April 15, 2021 14:45
@mgautierfr
Copy link
Collaborator Author

@legoktm This PR fails on bionic. It seems that it cannot find the gtest library. But nothing has changed on this side.
Do you know what is the problem ?

@legoktm
Copy link
Member

legoktm commented Apr 15, 2021

@legoktm This PR fails on bionic. It seems that it cannot find the gtest library. But nothing has changed on this side.
Do you know what is the problem ?

I think that's just meson checking to see whether gtest is installed or not. The real error appears to be:

test/meson.build:29:0: ERROR: Division works only with integers.

which corresponds to the line you added:

datadir = meson.current_source_dir() / 'data'

bionic uses the meson 0.45.1, which might not have that feature yet?

@mgautierfr
Copy link
Collaborator Author

bionic uses the meson 0.45.1, which might not have that feature yet?

Yes, it is probably that. I've missed the line. Thanks.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the main objective of this PR. My impression from #531 was that it was preparing ground for validating #518. Therefore my recommendation was not to aim at updating all tests at once, but create a few unit tests focusing on #518 using this new approach for test data management. As a result of my questions asked in #531 we now have a bigger PR with a larger and not so well defined scope - it both updates a lot of unit-tests and fixes some issues with Archive::iterByTitle().

test/tools.cpp Outdated
Comment on lines 114 to 115
std::string dataDir;
setDataDir(dataDir);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. Why not const std::string dataDir = getDataDir();?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said in the comment inside setDataDir, the gtest FAIL method can be used only in a void function. So we have to use this surprising (I agree) signature.

@mgautierfr
Copy link
Collaborator Author

I don't fully understand the main objective of this PR.

The main objective of this PR is to run all our unittests with zim files using the new "format" (no namespace, some fix on the metadata of the xapian db, ...).
We need to update our test data and we must keep the existing test data also. #531 is somehow a preparatory work of this PR (as I prefer to not add new test file to the git repository).

@veloman-yunkan
Copy link
Collaborator

I don't fully understand the main objective of this PR.

The main objective of this PR is to run all our unittests with zim files using the new "format" (no namespace, some fix on the metadata of the xapian db, ...).
We need to update our test data and we must keep the existing test data also. #531 is somehow a preparatory work of this PR (as I prefer to not add new test file to the git repository).

IMHO, before initiating any real changes in this direction we must have some vision. Of course exploratory/experimental changes are a way of learning important things that may influence the final vision/plan, but we should at least treat them as such in the beginning. So, to the best of my understanding you are trying to establish an approach for dealing with the need to test backward compatibility of libzim on multiple flavours of ZIM archive content. We need to realize that if ZIM archives have evolved before they will likely evolve in the future too (though we may be forced to limit the rate at which new flavours are introduced to keep the full current diversity of ZIM archive flavours manageable). The proposed approach for running the same test on multiple variants of the "same" ZIM archive must be as future-proof as possible. To this end, we better first understand what kind of diversity we will be dealing with.

Generally speaking, a ZIM archive flavour is a combinations of multiple features such as:

  1. Compression method
  2. Grouping of content (namespaces)
  3. Presence of the front article index
  4. Split or single-piece
  5. etc

Some tests will be concerned with only a single dimension in the total space of all possible ZIM flavours. Question: do we want to handle these and the next paragraph as special cases of the same general approach?

On the other hand, not all combinations are valid (old namespace scheme would never occur with a front article index). The latter observation suggests that a linear/versionable sequence of typical ZIM content flavours (i.e. a flavour of a ZIM archive that could have been created by libzim between such and such dates) also makes some sense (similar to ZIM format versions). Some high level tests will probably need to run on all those versions of ZIM archives (or, rather, starting from a particular version where the requisite feature first appears).

BTW, it is possible to take an approach where the test code is written to work with a single version of the given ZIM file (avoiding the boilerplate of iterating over ZIM files in each test case) - the responsibility of testing the given functionality on multiple flavours can be shifted onto a driver script that runs the same test in different environments effectively feeding different data to it. The tests can have certain metadata attached to them that will instruct the driver script which inputs to exercise the unit-test on.

@mgautierfr
Copy link
Collaborator Author

I totally agree with you. Except that it is not in the scope for now :)
Your comment make totally sense, and it could (should) be made in #469.
I realize we have this discussion with @kelson42 in our weekly meeting but never write this somewhere, so you don't fully understand where we are.
This is the next big project (and why we have created https://github.com/openzim/zim-testing-suite).
We will need to answer all you questions before starting the (real) implementation.

But for now this is not the scope.
The main goal for now is to finish libzim7 and do a release. We already have several mouths late on this.
We have to finish the API (#530, #514, #532), finish the format (#527) and test the new version code with old and new zim formats.
This PR is just testing the code with old and new format. This is a quick and dirty path. I totally agree this is not the good approach but we not really have the time to design a proper testing system and add another weeks of delay.

That is why I don't want to add more binary file to this repository : I'm pretty sure this is not the good solution and the test files will be dropped. And I don't want to increase the git history with file which will be removed in the next future but user will have to download all the time.
I've toke the path with less definitive impact on the future.
Just put the test data somewhere else, finish the release of libzim7 and then think about the testing suite to have a proper system.

Copy link
Collaborator

@veloman-yunkan veloman-yunkan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main goal for now is to finish libzim7 and do a release.

If we are going take shortcuts, then I will stop being too picky. I am leaving my feedback as comments. Consider it as an approval based on your judgement.

The "Fix 'iterByTitle'" doesn't fit well into this PR.

Naming the old variant of ZIM files "normal" is not a good idea. Soon nons will be the normal version before being overridden by something else. Use a name with a more stable meaning (e.g. withns or prelibzim7).

test/archive.cpp Outdated
Comment on lines 311 to 323
for(auto& testfile: getDataFilePath("invalid.bad_mimetype_in_dirent.zim")) {
std::string expected;
if (testfile.category == "normal") {
expected = "Entry M/Language has invalid MIME-type value 1234.\n";
} else {
expected = "Entry M/Scraper has invalid MIME-type value 1234.\n";
}
EXPECT_BROKEN_ZIMFILE(testfile.path, expected)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to generate invalid.bad_mimetype_in_dirent.zim in such a way that the error message is the same across all of its variants.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partly agree.
It would be better on the test side yes.
But then we would have a evolution in the script generating the test data without a easy way to know which version to use.

We should handle this issue correctly when we will design a correct testing suite.

test/archive.cpp Outdated Show resolved Hide resolved
test/archive.cpp Outdated Show resolved Hide resolved
test/archive.cpp Outdated Show resolved Hide resolved
test/find.cpp Outdated Show resolved Hide resolved
auto count = 0;
for(auto& entry: range0) {
count++;
ASSERT_EQ(entry.getPath().find("Першая_старонка.html"), 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't ASSERT_EQ(entry.getPath(), "Першая_старонка.html") a better check here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

archive.findByPath returns a range of entries starting by the given path.
So the "real" semantic check is that entry.getPath start by the given path, even if there is only one result.
I've used the same check that ByPath test.

test/tools.cpp Outdated Show resolved Hide resolved
@mgautierfr
Copy link
Collaborator Author

The "Fix 'iterByTitle'" doesn't fit well into this PR.

Why ?
I've move the fix in this PR as the "new" tests fails without the fix.
(We are iterating on all the entry by titles in checkEquivalence method and iterByTitle were generating wrong range for new zim files.)


@veloman-yunkan Do you still prefer to add the test data in this repository or it is ok for you that add the new test file in repository zim-testing-suite and merge this PR with #531.
@legoktm Would it be possible to add the test zim archive as "extra" source for deb packages ?

@veloman-yunkan
Copy link
Collaborator

@veloman-yunkan Do you still prefer to add the test data in this repository or it is ok for you that add the new test file in repository zim-testing-suite and merge this PR with #531.

Since we will have to share test ZIM archives between multiple kiwix-projects we will have to use a separate repository for them eventually, so it doesn't make sense to defer that. But there is one thing that must be done before that. If I have for whatever reason to travel back in time and work with an old version of source code, I want to know which version of test data that project used at that time (in other words zim-testing-suite must not differ in that regard from other dependencies). This is easiest achieved by making zim-testing-suite a submodule of every project depending on it.

@mgautierfr mgautierfr force-pushed the better_tests_data branch 2 times, most recently from 187ea4e to 95c4aac Compare April 28, 2021 12:38
@mgautierfr mgautierfr changed the base branch from master to no_test_data2 April 28, 2021 12:55
Base automatically changed from no_test_data2 to master April 28, 2021 12:56
We may have several "same" (variant) zim files to test.
Let's `getDataFilePath` return a list of path of those "same" files and
test each paths with the same test.
…ata.

Instead of copying those files in the build dir, directly use them in
the source dir.
- Now have subdirectory for each "variant".
- Two variants : withns and nons.
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
This allow tests to have information about the "kind" of the test file.
This is not used for now but will be in next commit.

It is also possible to force a particular category for the test.
(Not used neither here).
New zim files have totally different indexes.
The indexes are changed between test zim files.
Indexes are totally internal to zim files and don't have to be stable.
The "Main Page" entry is a redirection pointing to the real
main page.
But now, the redirection itself is not searchable, so we cannot
search for "Main Page". We have to use the real title of the target item.

On wikibooks_be_all_nopic_2017-02, the new zim file now have a
titleIndex so we run the "search test" in `checkEquivalence`.
But wikibooks_be_all_nopic_2017-02 contains two items with the title of
the mainPage :
- The (real) entry itself.
- A redirection pointing to the real entry.

So, we need to also resolve the redirection of the search result to check
that path correspond.
The paths are different by definition.
Let's create another test.
Some error message are slightly different between normal and nons test
zim files.
@mgautierfr mgautierfr merged commit ce8dd1a into master Apr 28, 2021
@mgautierfr mgautierfr deleted the better_tests_data branch April 28, 2021 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants