Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rare ExHentai duplicated metadata bug #3033

Merged
merged 3 commits into from
Oct 13, 2022
Merged

Fix rare ExHentai duplicated metadata bug #3033

merged 3 commits into from
Oct 13, 2022

Conversation

pink-red
Copy link
Contributor

@pink-red pink-red commented Oct 10, 2022

I'm using gallery-dl as a library and I've created a custom job for metadata extraction, similar to DataJob. But I was getting identical metadata for different pages on ExHentai. Digging a bit deeper, here's what I've found.

ExhentaiGalleryExtractor.items yields the same object for each image in a gallery. This causes problems when the object is not immediately used.

To reproduce:

gallery-dl --dump-json https://e-hentai.org/g/2346908/ab876a4073/ -o output.private=true

-o output.private=true causes image kwdict to be passed through util.identity in DataJob.handle_url:

self.filter = util.identity if private else util.filter_dict

def handle_url(self, url, kwdict):
self.data.append((Message.Url, url, self.filter(kwdict)))

Since util.identity doesn't have a side-effect of creating a new dictionary like util.filter_dict, and since DataJob doesn't use the objects immediately, this triggers the bug.

@mikf
Copy link
Owner

mikf commented Oct 10, 2022

It is not just the exhentai extractor that reuses the same dict over and over during its data extraction process. This is more of a general trend among all extractor modules, and only "fixing" exhentai does therefore not make too much sense.

This whole mess will eventually be fixed and every file returned by an extractor will be its own independent object together with a change of how extractors return there results, but that will most likely be a lot of work and is planned for v2.0.

In your case, I'd just replace util.identity with dict.copy for DataJob's filter. Does essentially the same and returns a copy.

@mikf mikf merged commit 88f8975 into mikf:master Oct 13, 2022
@pink-red
Copy link
Contributor Author

In your case, I'd just replace util.identity with dict.copy for DataJob's filter. Does essentially the same and returns a copy.

Ah, indeed, that would be much simpler. Thank you!

@pink-red pink-red deleted the patch-1 branch October 14, 2022 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants