Add zim files in the "new" format as testing data. #535

mgautierfr · 2021-04-15T14:08:21Z

This PR somehow invalidate #531
(Moving the zim file in a different repository is still a open idea. But it can be done after this PR. Although it would be pretty useless. Adding the binary files to git will grow the total git repository size even if we remove them just after).

First commits prepare a bit the tests tools to be prepared to test several "same" test files.
After we add the new test files (which can be replaced by the download presented in #531)
Then we update the tests for the new test files.

It also "cherry-pick" #518 which now because obsolete.

codecov · 2021-04-15T14:09:13Z

Codecov Report

Merging #535 (91eab3b) into master (75fcc61) will increase coverage by 0.58%.
The diff coverage is 91.66%.

❗ Current head 91eab3b differs from pull request most recent head 6bc2dab. Consider uploading reports for the commit 6bc2dab to get more accurate results

@@            Coverage Diff             @@
##           master     #535      +/-   ##
==========================================
+ Coverage   79.30%   79.88%   +0.58%     
==========================================
  Files          91       91              
  Lines        3744     3748       +4     
  Branches     1701     1661      -40     
==========================================
+ Hits         2969     2994      +25     
+ Misses        774      753      -21     
  Partials        1        1

Impacted Files	Coverage Δ
src/writer/handler.h	`100.00% <ø> (ø)`
src/writer/counterHandler.cpp	`87.50% <87.50%> (ø)`
src/archive.cpp	`62.98% <100.00%> (+4.32%)`	⬆️
src/fileimpl.cpp	`85.62% <100.00%> (+1.57%)`	⬆️
src/fileimpl.h	`90.90% <100.00%> (+0.90%)`	⬆️
src/writer/counterHandler.h	`100.00% <100.00%> (ø)`
src/writer/creator.cpp	`83.49% <100.00%> (ø)`
src/writer/titleListingHandler.h	`100.00% <100.00%> (ø)`
src/writer/xapianHandler.h	`100.00% <100.00%> (ø)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75fcc61...6bc2dab. Read the comment docs.

mgautierfr · 2021-04-15T14:59:15Z

@legoktm This PR fails on bionic. It seems that it cannot find the gtest library. But nothing has changed on this side.
Do you know what is the problem ?

legoktm · 2021-04-15T16:29:36Z

@legoktm This PR fails on bionic. It seems that it cannot find the gtest library. But nothing has changed on this side.
Do you know what is the problem ?

I think that's just meson checking to see whether gtest is installed or not. The real error appears to be:

test/meson.build:29:0: ERROR: Division works only with integers.

which corresponds to the line you added:

datadir = meson.current_source_dir() / 'data'

bionic uses the meson 0.45.1, which might not have that feature yet?

mgautierfr · 2021-04-15T17:00:07Z

bionic uses the meson 0.45.1, which might not have that feature yet?

Yes, it is probably that. I've missed the line. Thanks.

veloman-yunkan

I don't fully understand the main objective of this PR. My impression from #531 was that it was preparing ground for validating #518. Therefore my recommendation was not to aim at updating all tests at once, but create a few unit tests focusing on #518 using this new approach for test data management. As a result of my questions asked in #531 we now have a bigger PR with a larger and not so well defined scope - it both updates a lot of unit-tests and fixes some issues with Archive::iterByTitle().

veloman-yunkan · 2021-04-15T19:43:20Z

test/tools.cpp

+  std::string dataDir;
+  setDataDir(dataDir);


This is confusing. Why not const std::string dataDir = getDataDir();?

As said in the comment inside setDataDir, the gtest FAIL method can be used only in a void function. So we have to use this surprising (I agree) signature.

mgautierfr · 2021-04-16T12:30:16Z

I don't fully understand the main objective of this PR.

The main objective of this PR is to run all our unittests with zim files using the new "format" (no namespace, some fix on the metadata of the xapian db, ...).
We need to update our test data and we must keep the existing test data also. #531 is somehow a preparatory work of this PR (as I prefer to not add new test file to the git repository).

veloman-yunkan · 2021-04-16T16:12:21Z

I don't fully understand the main objective of this PR.

The main objective of this PR is to run all our unittests with zim files using the new "format" (no namespace, some fix on the metadata of the xapian db, ...).
We need to update our test data and we must keep the existing test data also. #531 is somehow a preparatory work of this PR (as I prefer to not add new test file to the git repository).

IMHO, before initiating any real changes in this direction we must have some vision. Of course exploratory/experimental changes are a way of learning important things that may influence the final vision/plan, but we should at least treat them as such in the beginning. So, to the best of my understanding you are trying to establish an approach for dealing with the need to test backward compatibility of libzim on multiple flavours of ZIM archive content. We need to realize that if ZIM archives have evolved before they will likely evolve in the future too (though we may be forced to limit the rate at which new flavours are introduced to keep the full current diversity of ZIM archive flavours manageable). The proposed approach for running the same test on multiple variants of the "same" ZIM archive must be as future-proof as possible. To this end, we better first understand what kind of diversity we will be dealing with.

Generally speaking, a ZIM archive flavour is a combinations of multiple features such as:

Compression method
Grouping of content (namespaces)
Presence of the front article index
Split or single-piece
etc

Some tests will be concerned with only a single dimension in the total space of all possible ZIM flavours. Question: do we want to handle these and the next paragraph as special cases of the same general approach?

On the other hand, not all combinations are valid (old namespace scheme would never occur with a front article index). The latter observation suggests that a linear/versionable sequence of typical ZIM content flavours (i.e. a flavour of a ZIM archive that could have been created by libzim between such and such dates) also makes some sense (similar to ZIM format versions). Some high level tests will probably need to run on all those versions of ZIM archives (or, rather, starting from a particular version where the requisite feature first appears).

BTW, it is possible to take an approach where the test code is written to work with a single version of the given ZIM file (avoiding the boilerplate of iterating over ZIM files in each test case) - the responsibility of testing the given functionality on multiple flavours can be shifted onto a driver script that runs the same test in different environments effectively feeding different data to it. The tests can have certain metadata attached to them that will instruct the driver script which inputs to exercise the unit-test on.

mgautierfr · 2021-04-16T17:40:37Z

I totally agree with you. Except that it is not in the scope for now :)
Your comment make totally sense, and it could (should) be made in #469.
I realize we have this discussion with @kelson42 in our weekly meeting but never write this somewhere, so you don't fully understand where we are.
This is the next big project (and why we have created https://github.com/openzim/zim-testing-suite).
We will need to answer all you questions before starting the (real) implementation.

But for now this is not the scope.
The main goal for now is to finish libzim7 and do a release. We already have several mouths late on this.
We have to finish the API (#530, #514, #532), finish the format (#527) and test the new version code with old and new zim formats.
This PR is just testing the code with old and new format. This is a quick and dirty path. I totally agree this is not the good approach but we not really have the time to design a proper testing system and add another weeks of delay.

That is why I don't want to add more binary file to this repository : I'm pretty sure this is not the good solution and the test files will be dropped. And I don't want to increase the git history with file which will be removed in the next future but user will have to download all the time.
I've toke the path with less definitive impact on the future.
Just put the test data somewhere else, finish the release of libzim7 and then think about the testing suite to have a proper system.

veloman-yunkan

The main goal for now is to finish libzim7 and do a release.

If we are going take shortcuts, then I will stop being too picky. I am leaving my feedback as comments. Consider it as an approval based on your judgement.

The "Fix 'iterByTitle'" doesn't fit well into this PR.

Naming the old variant of ZIM files "normal" is not a good idea. Soon nons will be the normal version before being overridden by something else. Use a name with a more stable meaning (e.g. withns or prelibzim7).

veloman-yunkan · 2021-04-17T15:19:03Z

test/archive.cpp

+  for(auto& testfile: getDataFilePath("invalid.bad_mimetype_in_dirent.zim")) {
+    std::string expected;
+    if (testfile.category == "normal") {
+      expected = "Entry M/Language has invalid MIME-type value 1234.\n";
+    } else {
+      expected = "Entry M/Scraper has invalid MIME-type value 1234.\n";
+    }
+    EXPECT_BROKEN_ZIMFILE(testfile.path, expected)
+  }
 }


It would be better to generate invalid.bad_mimetype_in_dirent.zim in such a way that the error message is the same across all of its variants.

Partly agree.
It would be better on the test side yes.
But then we would have a evolution in the script generating the test data without a easy way to know which version to use.

We should handle this issue correctly when we will design a correct testing suite.

test/archive.cpp

test/find.cpp

veloman-yunkan · 2021-04-17T15:40:23Z

test/find.cpp

+    auto count = 0;
+    for(auto& entry: range0) {
+      count++;
+      ASSERT_EQ(entry.getPath().find("Першая_старонка.html"), 0);


Isn't ASSERT_EQ(entry.getPath(), "Першая_старонка.html") a better check here?

archive.findByPath returns a range of entries starting by the given path.
So the "real" semantic check is that entry.getPath start by the given path, even if there is only one result.
I've used the same check that ByPath test.

test/tools.cpp

mgautierfr · 2021-04-19T12:50:32Z

The "Fix 'iterByTitle'" doesn't fit well into this PR.

Why ?
I've move the fix in this PR as the "new" tests fails without the fix.
(We are iterating on all the entry by titles in checkEquivalence method and iterByTitle were generating wrong range for new zim files.)

@veloman-yunkan Do you still prefer to add the test data in this repository or it is ok for you that add the new test file in repository zim-testing-suite and merge this PR with #531.
@legoktm Would it be possible to add the test zim archive as "extra" source for deb packages ?

veloman-yunkan · 2021-04-19T15:37:08Z

@veloman-yunkan Do you still prefer to add the test data in this repository or it is ok for you that add the new test file in repository zim-testing-suite and merge this PR with #531.

Since we will have to share test ZIM archives between multiple kiwix-projects we will have to use a separate repository for them eventually, so it doesn't make sense to defer that. But there is one thing that must be done before that. If I have for whatever reason to travel back in time and work with an old version of source code, I want to know which version of test data that project used at that time (in other words zim-testing-suite must not differ in that regard from other dependencies). This is easiest achieved by making zim-testing-suite a submodule of every project depending on it.

We may have several "same" (variant) zim files to test. Let's `getDataFilePath` return a list of path of those "same" files and test each paths with the same test.

…ata. Instead of copying those files in the build dir, directly use them in the source dir.

- Now have subdirectory for each "variant". - Two variants : withns and nons.

If we have a specific title index, we must iterate on it. We must not iterate on a (wrong) subset of the entries. Related to #514

This allow tests to have information about the "kind" of the test file. This is not used for now but will be in next commit. It is also possible to force a particular category for the test. (Not used neither here).

New zim files have totally different indexes.

The indexes are changed between test zim files. Indexes are totally internal to zim files and don't have to be stable.

The "Main Page" entry is a redirection pointing to the real main page. But now, the redirection itself is not searchable, so we cannot search for "Main Page". We have to use the real title of the target item. On wikibooks_be_all_nopic_2017-02, the new zim file now have a titleIndex so we run the "search test" in `checkEquivalence`. But wikibooks_be_all_nopic_2017-02 contains two items with the title of the mainPage : - The (real) entry itself. - A redirection pointing to the real entry. So, we need to also resolve the redirection of the search result to check that path correspond.

The paths are different by definition. Let's create another test.

Some error message are slightly different between normal and nons test zim files.

mgautierfr force-pushed the better_tests_data branch 4 times, most recently from e070817 to f5454c6 Compare April 15, 2021 14:45

mgautierfr requested a review from veloman-yunkan April 15, 2021 14:59

This was referenced Apr 15, 2021

Use the zim files in zim-testing-suite for unit tests. #531

Closed

Fix iterByTitle. #518

Closed

veloman-yunkan reviewed Apr 15, 2021

View reviewed changes

veloman-yunkan reviewed Apr 17, 2021

View reviewed changes

mgautierfr force-pushed the better_tests_data branch from 968ce9b to a6fee67 Compare April 19, 2021 12:51

This was referenced Apr 20, 2021

Remove test data from repository and use data from zim-testing-suite. #538

Merged

Counter metadata #539

Merged

mgautierfr force-pushed the better_tests_data branch from 91eab3b to a6fee67 Compare April 21, 2021 09:05

mgautierfr force-pushed the better_tests_data branch 2 times, most recently from 187ea4e to 95c4aac Compare April 28, 2021 12:38

mgautierfr changed the base branch from master to no_test_data2 April 28, 2021 12:55

Base automatically changed from no_test_data2 to master April 28, 2021 12:56

mgautierfr added 4 commits April 28, 2021 15:02

[TEST] Use a helper function to get the path of a test zim file.

47abc5f

Update getDataFilePath to return a list of path.

4fe2e25

We may have several "same" (variant) zim files to test. Let's `getDataFilePath` return a list of path of those "same" files and test each paths with the same test.

Use a environment variable to specify the directory containing test d…

e504aa2

…ata. Instead of copying those files in the build dir, directly use them in the source dir.

Use the version 0.2 of the test data.

4dce48c

- Now have subdirectory for each "variant". - Two variants : withns and nons.

mgautierfr added 10 commits April 28, 2021 15:02

Adapt test tools to search for zim files in a subdirectory.

687c77a

Fix iterByTitle.

4a58bdb

If we have a specific title index, we must iterate on it. We must not iterate on a (wrong) subset of the entries. Related to #514

Update getDataFilePath to return information about the test file.

d00ff0a

This allow tests to have information about the "kind" of the test file. This is not used for now but will be in next commit. It is also possible to force a particular category for the test. (Not used neither here).

Test iterator only on "withns" zim files.

a84403a

New zim files have totally different indexes.

Do not check index in FindTests::ByTitle.

5e9fd9a

The indexes are changed between test zim files. Indexes are totally internal to zim files and don't have to be stable.

Fix FindTests::ByPath.

51447df

The paths are different by definition. Let's create another test.

Fix ZimArchive::validate.

6c84b91

Some error message are slightly different between normal and nons test zim files.

Fix (workaround) discovering of test zim file on windows.

f65d19d

Remove unused test.

6bc2dab

mgautierfr force-pushed the better_tests_data branch from 5f1f261 to 6bc2dab Compare April 28, 2021 13:02

mgautierfr merged commit ce8dd1a into master Apr 28, 2021

mgautierfr deleted the better_tests_data branch April 28, 2021 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add zim files in the "new" format as testing data. #535

Add zim files in the "new" format as testing data. #535

mgautierfr commented Apr 15, 2021 •

edited

Loading

codecov bot commented Apr 15, 2021 •

edited

Loading

mgautierfr commented Apr 15, 2021

legoktm commented Apr 15, 2021

mgautierfr commented Apr 15, 2021

veloman-yunkan left a comment

veloman-yunkan Apr 15, 2021

mgautierfr Apr 16, 2021

mgautierfr commented Apr 16, 2021

veloman-yunkan commented Apr 16, 2021

mgautierfr commented Apr 16, 2021

veloman-yunkan left a comment •

edited

Loading

veloman-yunkan Apr 17, 2021

mgautierfr Apr 19, 2021

veloman-yunkan Apr 17, 2021

mgautierfr Apr 19, 2021

mgautierfr commented Apr 19, 2021

veloman-yunkan commented Apr 19, 2021

Add zim files in the "new" format as testing data. #535

Add zim files in the "new" format as testing data. #535

Conversation

mgautierfr commented Apr 15, 2021 • edited Loading

codecov bot commented Apr 15, 2021 • edited Loading

Codecov Report

mgautierfr commented Apr 15, 2021

legoktm commented Apr 15, 2021

mgautierfr commented Apr 15, 2021

veloman-yunkan left a comment

Choose a reason for hiding this comment

veloman-yunkan Apr 15, 2021

Choose a reason for hiding this comment

mgautierfr Apr 16, 2021

Choose a reason for hiding this comment

mgautierfr commented Apr 16, 2021

veloman-yunkan commented Apr 16, 2021

mgautierfr commented Apr 16, 2021

veloman-yunkan left a comment • edited Loading

Choose a reason for hiding this comment

veloman-yunkan Apr 17, 2021

Choose a reason for hiding this comment

mgautierfr Apr 19, 2021

Choose a reason for hiding this comment

veloman-yunkan Apr 17, 2021

Choose a reason for hiding this comment

mgautierfr Apr 19, 2021

Choose a reason for hiding this comment

mgautierfr commented Apr 19, 2021

veloman-yunkan commented Apr 19, 2021

mgautierfr commented Apr 15, 2021 •

edited

Loading

codecov bot commented Apr 15, 2021 •

edited

Loading

veloman-yunkan left a comment •

edited

Loading