Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deduplicating #6099

Closed
yggdrasil75 opened this issue Aug 28, 2024 · 7 comments
Closed

deduplicating #6099

yggdrasil75 opened this issue Aug 28, 2024 · 7 comments

Comments

@yggdrasil75
Copy link

I am trying to download images from various boorus to use to train a lora. I want to make sure that I dont get duplicates. I am using "filename": "{md5}.{extension}" but I am getting "none" instead of an md5. I tried checksummd5 as well, but still get none. how do I prevent duplicates between multiple boorus using a consistent format without getting "none"?

@Hrxn
Copy link
Contributor

Hrxn commented Aug 28, 2024

{md5} should be right, given that these values are provided by the booru. You can always check with gallery-dl -K YourURL though.

"none" means there's probably something wrong with your config here.

Assuming that all your targeted boorus actually provide MD5 sums, and that they are actually correct between different boorus, you should try using settings like these for the booru sites:

            "archive-prefix": "",
            "archive-format": "{md5|hash}",
            "archive": "~/gallery-dl/archives/single_archive_all_boorus.db",

That is, use a single archive file for all targeted boorus, and use the same archive format setting, here first trying md5, and if not available, then hash, should the case arise that different boorus use different metadata names here.

The other, maybe even more simpler and robust solution (but hey, if this is just for training data, do some misses really matter?):

First download, and then deduplicate on your drive.

@yggdrasil75
Copy link
Author

dedup on my own may be what I go with. I am also trying to merge the resultant tags so that if an image on 1 booru has only 5 tags, while another has 50, I get 55 tags instead of just 5 (or just 50, whichever is last I guess) unless there is a way to have the metadata field be appended to rather than replaced. I tried md5|hash in the archive prefix field and didnt get a difference.

@Hrxn
Copy link
Contributor

Hrxn commented Aug 28, 2024

Whoops, sorry, I've made a mistake in my comment above (now corrected).
The value stored in the archive is "archive-format", while at the same time "archive-prefix" needs to be turned off for this all sites in a single archive thing to work.

@mikf
Copy link
Owner

mikf commented Aug 29, 2024

I am using "filename": "{md5}.{extension}" but I am getting "none"

Nearly all *booru sites provide an {md5} value. The one exception I can think of is e621, where this value can be accessed as {file[md5]} or {filename}.

{md5|file[md5]|filename}

@yggdrasil75
Copy link
Author

is there a way to manually generate a hash upon user request? whether that hashes the image or the file, might allow more consistency. I know it will slow it down if you have good download speed but a slow cpu, but hashing the image content without the metadata might allow me to find the same image automatically more easily to prevent duplicates from a source site and a booru. especially if a site provides an md5 that may be incorrect for whatever reason (database corruption or bad practices or whatever)

@a84r7a3rga76fg
Copy link

especially if a site provides an md5 that may be incorrect for whatever reason

Happens a lot on Danbooru, Gelbooru and all of the other Boorus that I've tested on.

@mikf
Copy link
Owner

mikf commented Sep 3, 2024

is there a way to manually generate a hash upon user request?

ae9b0da

"postprocessors": ["hash"] or -P hash will generate MD5 and SHA1 hash digests for downloaded files, but you can also use more sophisticated options for other hashes etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants