Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix semantics about number of entry/article/front article/...( #514

Closed
mgautierfr opened this issue Mar 4, 2021 · 3 comments · Fixed by #576
Closed

Fix semantics about number of entry/article/front article/...( #514

mgautierfr opened this issue Mar 4, 2021 · 3 comments · Fixed by #576
Assignees
Milestone

Comments

@mgautierfr
Copy link
Collaborator

mgautierfr commented Mar 4, 2021

With current zim library there is no real distinction between the kind of article in a zim file:

  • File::getCountArticles() return the number of articles (all entries) in the zim file.
  • File::getNamespaceCount(char ns) return the number of articles in the specified namespace.

On top of that kiwix-lib has :

  • Reader::getGlobalCount() returning File::getCountArticles()
  • Reader::getArticleCount() which parse the M/Counter metadata to return the number of html articles (If M/Counter is not available, return the number of articles in A namespace)
  • Reader::getMediaCount() which parse the M/Counter metadata to return the number of image/video articles.

With "new" zim library (in master, to be released) it is a bit more complex.
For now we only have Archive::getEntryCount() which return the number of user entries (On old zim files, it is all entries. On new zim files it is entries in C namespace)

We also have three ways to iterate on entries :

  • iterByPath which iterate on all user entries.
  • iterByTitle which iterate on all user entries.
  • iterEfficient which iterate on ALL entries.

And on top of that kiwix-lib has :

  • Reader::getGlobalCount() returning Archive::getEntryCount()
  • Reader::getArticleCount() which parse the M/Counter metadata to return the number of html articles (No fallback if M/Counter is not available)
  • Reader::getMediaCount() which parse the M/Counter metadata to return the number of image/video articles.

Parsing the M/Counter metadata is still available but we have no specific api for that.

With the recent changes, we add a specific listing in zim file to reference "Front articles" (Entries to be displayed as "real" entry to the user, in opposition to "resource" entries). While those front articles are always html content for now, this is not enforce and we shouldn't assume that.
Searching/Iterating by title is/should be made only on those front articles (which may/will be subset of all "user" entries)
Random and suggestions are also made on those front articles.

On top of that, please remember that it is totally valid to have a zim file using the new namespace scheme (all user entries in C namespace) but without specific front article listing. We (openzim) will probably never generate them but we must be prepared to read them.

We need to define a api to provide some kind of coherent values and a definition of those values and make the api adapt to what zim file version we have to return coherent values (in regards of their definition).

I propose (but I'm really open to any suggestion):

  • getAllEntryCount(). Return the number of all user entries.
  • getEntryCount(). Returning the number of user entries (all on old zim files, C entries in zim files using new namespace scheme).
  • getArticleCount(). Returning the number of entries accessible through there titles.
    This is technically the number of entries in the title listing (specific or not).
    . On old zim file (old namespace scheme, no specific listing) this is the same than getEntryCount().
    . On zim file with new namespace scheme but no specific listing, this will be the count of ALL entries. (Which will be greater than getEntryCount)
    . On zim file with new namespace scheme and specific listing, this will be the number of entries in the specific listing (leather than getEntryCount)
    What is important is that it is the number of entries you will have if you iterate on the range returned by Archive::iterByTitle (which is presently buggy for zim with specific listing)
  • hasSpecificTitleListing() telling if zim file has a specific title listing or not.

The three ways to iterate on entries would become :

  • iterByPath which iterate on all user entries.
  • iterByTitle which iterate on all front article (as said before, it could be more than all user entries)
  • iterEfficient which iterate on ALL entries.

Or...
For old zim file make :

  • iterByPath, iterByTitle, iterEfficient iterates on all entries (but with different order)
  • getAllEntryCount, getEntryCount, getArticleCount returns the same things (the number of all entries)

For zim file with new namespace scheme but no specific listing :

  • iterByPath, iterByTitle, iterEfficient iterates on all user entries (C namespace) (but with different order)
  • getAllEntryCount returns the number of all entries.
  • getEntryCount, getArticleCount returns the number of all user entries.

For zim file with new namespace scheme and specific listing :

  • iterByPath, iterEfficient iterates on all user entries (C namespace) (but with different order)
  • iterByTitle iterates on front article (listed in specific listing)
  • getAllEntryCount returns the number of all entries.
  • getEntryCount returns the number of all user entries.
  • getArticleCount returns the number of front articles.
@rgaudin
Copy link
Member

rgaudin commented Mar 8, 2021

Thanks @mgautierfr for the detailed explanation and proposition.

This sounds OK. I'm mostly interested in the latest case (new namespace, with listing) as this is what we are about to create now and it sounds reasonable. I guess extending the hints later may have an impact but it's not for the need future so we'll see when this time comes.

mgautierfr added a commit that referenced this issue Mar 10, 2021
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
mgautierfr added a commit that referenced this issue Mar 10, 2021
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
@kelson42
Copy link
Contributor

@mgautierfr This looks good even if this starts indeed to be quite complex. I have one worry, this is the redirect handling. If I understand properly redirects and articles are treated indifferently. One count usage which is on the top of my mind is the numbers communicated in the Kiwix library (number of medias, number of articles). To me it looks like that what is needed there is the number of front articles but without redirects... and it seems impossible to get that number right?

@mgautierfr
Copy link
Collaborator Author

To me it looks like that what is needed there is the number of front articles but without redirects... and it seems impossible to get that number right?

From zim file format itself, no.
But we could store that in a metadata (M/Counter could be a good candidate)

mgautierfr added a commit that referenced this issue Mar 30, 2021
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
mgautierfr added a commit that referenced this issue Apr 15, 2021
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
mgautierfr added a commit that referenced this issue Apr 15, 2021
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
mgautierfr added a commit that referenced this issue Apr 15, 2021
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
mgautierfr added a commit that referenced this issue Apr 19, 2021
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
@kelson42 kelson42 pinned this issue Apr 21, 2021
mgautierfr added a commit that referenced this issue Apr 28, 2021
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
mgautierfr added a commit that referenced this issue Apr 28, 2021
If we have a specific title index, we must iterate on it.
We must not iterate on a (wrong) subset of the entries.

Related to #514
@kelson42 kelson42 unpinned this issue May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants