Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PRIO] Tags Metadata truncated #139

Closed
rgaudin opened this issue May 5, 2022 · 6 comments
Closed

[PRIO] Tags Metadata truncated #139

rgaudin opened this issue May 5, 2022 · 6 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@rgaudin
Copy link
Member

rgaudin commented May 5, 2022

Here's an example with this small ZIM: gutenberg_he_all_2022-04.zim

In [1]: from libzim.reader import Archive

In [2]: zim = Archive("/Users/reg/Downloads/gutenberg_he_all_2022-04.zim")

In [3]: zim.get_metadata("Tags")
Out[3]: b'_category:gutenberg;gutenberg;_ftindex:yes;_ftindex'

Now, using libkiwix:

kiwix-manage xxx.xml add gutenberg_he_all_2022-04.zim

And here's the (formatted, favicon removed) output

<library version="20110515">
  <book id="f4798cdd-47c2-b247-6493-d0e2f656fdc4"
    path="gutenberg_he_all_2022-04.zim"
    title="Project Gutenberg Library (he)"
    description="The first producer of free ebooks"
    language="heb"
    creator="gutenberg.org"
    publisher="Kiwix"
    name="gutenberg_he_all_2022-04"
    tags="_category:gutenberg;gutenberg;_ftindex:yes;_ftindex:yes;_pictures:yes;_videos:yes;_details:yes"
    faviconMimeType="image/png"
    favicon="[snip]"
    date="2022-04-21"
    articleCount="22"
    mediaCount="41"
    size="4705" />
</library>

You'll notice that _ftindex:yes is repeated but AFAIK libzim doesn't care about the content of metdata…

@mgautierfr please take a look ; this is accidentally blocking a lot of stuff on my side.

@rgaudin rgaudin added the bug Something isn't working label May 5, 2022
@kelson42 kelson42 added this to the 1.1.0 milestone May 5, 2022
@kelson42
Copy link
Contributor

kelson42 commented May 7, 2022

@mgautierfr Really important to release soon next 1.1.0 release with a fix for that. The stability and performance of library.kiwix.org depends of it.

@mgautierfr
Copy link
Collaborator

I'm not sure about what your complaining exactly:

  • libzim show what is in the zim file. Nothing else. It doesn't interpret the Tags metadata to have something normalized, it is the job of libkiwix
  • The duplicated _ftindex:yes is somehow a bug (in https://github.com/kiwix/libkiwix/blob/master/src/tools/otherTools.cpp#L219-L254) (but it was duplicated in the zim file itself). But it should not be a problem (as far as the two _ftindex:* are coherent)
  • libkiwix add _images:yes, _video:yes if there is no information in the zim file as those tags are somehow mandatory and this was the default on old zim file (we introduced nopic/novid/... to express there was no image/video/...)

How is it a blocker for you ?

@rgaudin
Copy link
Member Author

rgaudin commented May 9, 2022

Ah! I didn't know that libkiwix added those tags. This solves this frightening mystery.

How is it a blocker? The central XML library used to be generated using kiwix-manage. It is now generated by a pylibzim-based script but we had a lot of different entries for the same content.

I imagine some readers may use those tags so I'll port that feature to the script (in scraperlib I suppose).

Thanks for the answer ; we knew it would be something obvious but I didn't expect this 😉

@rgaudin rgaudin closed this as completed May 9, 2022
@kelson42
Copy link
Contributor

kelson42 commented May 9, 2022

Zimdump should better ne used for inspwcting a ZIM.

@mgautierfr
Copy link
Collaborator

BTW, you probably have a bug in the creator/scrapper as you don't put the right Tags in the zim file.

@rgaudin
Copy link
Member Author

rgaudin commented May 9, 2022

Yes, I believe most non-mwoffliner scrapers don't specify all of those. I'll check all of them. We usually don't have flavours/filters but the ftindex tag might be missing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants