[PRIO] Tags Metadata truncated #139

rgaudin · 2022-05-05T17:46:09Z

Here's an example with this small ZIM: gutenberg_he_all_2022-04.zim

In [1]: from libzim.reader import Archive

In [2]: zim = Archive("/Users/reg/Downloads/gutenberg_he_all_2022-04.zim")

In [3]: zim.get_metadata("Tags")
Out[3]: b'_category:gutenberg;gutenberg;_ftindex:yes;_ftindex'

Now, using libkiwix:

kiwix-manage xxx.xml add gutenberg_he_all_2022-04.zim

And here's the (formatted, favicon removed) output

<library version="20110515">
  <book id="f4798cdd-47c2-b247-6493-d0e2f656fdc4"
    path="gutenberg_he_all_2022-04.zim"
    title="Project Gutenberg Library (he)"
    description="The first producer of free ebooks"
    language="heb"
    creator="gutenberg.org"
    publisher="Kiwix"
    name="gutenberg_he_all_2022-04"
    tags="_category:gutenberg;gutenberg;_ftindex:yes;_ftindex:yes;_pictures:yes;_videos:yes;_details:yes"
    faviconMimeType="image/png"
    favicon="[snip]"
    date="2022-04-21"
    articleCount="22"
    mediaCount="41"
    size="4705" />
</library>

You'll notice that _ftindex:yes is repeated but AFAIK libzim doesn't care about the content of metdata…

@mgautierfr please take a look ; this is accidentally blocking a lot of stuff on my side.

The text was updated successfully, but these errors were encountered:

kelson42 · 2022-05-07T20:31:50Z

@mgautierfr Really important to release soon next 1.1.0 release with a fix for that. The stability and performance of library.kiwix.org depends of it.

mgautierfr · 2022-05-09T09:19:39Z

I'm not sure about what your complaining exactly:

libzim show what is in the zim file. Nothing else. It doesn't interpret the Tags metadata to have something normalized, it is the job of libkiwix
The duplicated _ftindex:yes is somehow a bug (in https://github.com/kiwix/libkiwix/blob/master/src/tools/otherTools.cpp#L219-L254) (but it was duplicated in the zim file itself). But it should not be a problem (as far as the two _ftindex:* are coherent)
libkiwix add _images:yes, _video:yes if there is no information in the zim file as those tags are somehow mandatory and this was the default on old zim file (we introduced nopic/novid/... to express there was no image/video/...)

How is it a blocker for you ?

rgaudin · 2022-05-09T09:27:06Z

Ah! I didn't know that libkiwix added those tags. This solves this frightening mystery.

How is it a blocker? The central XML library used to be generated using kiwix-manage. It is now generated by a pylibzim-based script but we had a lot of different entries for the same content.

I imagine some readers may use those tags so I'll port that feature to the script (in scraperlib I suppose).

Thanks for the answer ; we knew it would be something obvious but I didn't expect this 😉

kelson42 · 2022-05-09T09:29:42Z

Zimdump should better ne used for inspwcting a ZIM.

mgautierfr · 2022-05-09T09:45:17Z

BTW, you probably have a bug in the creator/scrapper as you don't put the right Tags in the zim file.

rgaudin · 2022-05-09T09:51:33Z

Yes, I believe most non-mwoffliner scrapers don't specify all of those. I'll check all of them. We usually don't have flavours/filters but the ftindex tag might be missing.

rgaudin added the bug Something isn't working label May 5, 2022

rgaudin assigned mgautierfr May 5, 2022

kelson42 added this to the 1.1.0 milestone May 5, 2022

rgaudin mentioned this issue May 6, 2022

[PRIO] library cannot be refreshed kiwix/operations#22

Closed

rgaudin closed this as completed May 9, 2022

mgautierfr mentioned this issue May 9, 2022

Drop libzim wrapper kiwix/libkiwix#430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PRIO] Tags Metadata truncated #139

[PRIO] Tags Metadata truncated #139

rgaudin commented May 5, 2022

kelson42 commented May 7, 2022

mgautierfr commented May 9, 2022

rgaudin commented May 9, 2022

kelson42 commented May 9, 2022

mgautierfr commented May 9, 2022

rgaudin commented May 9, 2022

[PRIO] Tags Metadata truncated #139

[PRIO] Tags Metadata truncated #139

Comments

rgaudin commented May 5, 2022

kelson42 commented May 7, 2022

mgautierfr commented May 9, 2022

rgaudin commented May 9, 2022

kelson42 commented May 9, 2022

mgautierfr commented May 9, 2022

rgaudin commented May 9, 2022