-
-
Notifications
You must be signed in to change notification settings - Fork 975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deduplicating #6099
Comments
Assuming that all your targeted boorus actually provide MD5 sums, and that they are actually correct between different boorus, you should try using settings like these for the booru sites:
That is, use a single archive file for all targeted boorus, and use the same archive format setting, here first trying The other, maybe even more simpler and robust solution (but hey, if this is just for training data, do some misses really matter?): First download, and then deduplicate on your drive. |
dedup on my own may be what I go with. I am also trying to merge the resultant tags so that if an image on 1 booru has only 5 tags, while another has 50, I get 55 tags instead of just 5 (or just 50, whichever is last I guess) unless there is a way to have the metadata field be appended to rather than replaced. I tried md5|hash in the archive prefix field and didnt get a difference. |
Whoops, sorry, I've made a mistake in my comment above (now corrected). |
Nearly all *booru sites provide an
|
is there a way to manually generate a hash upon user request? whether that hashes the image or the file, might allow more consistency. I know it will slow it down if you have good download speed but a slow cpu, but hashing the image content without the metadata might allow me to find the same image automatically more easily to prevent duplicates from a source site and a booru. especially if a site provides an md5 that may be incorrect for whatever reason (database corruption or bad practices or whatever) |
Happens a lot on Danbooru, Gelbooru and all of the other Boorus that I've tested on. |
|
I am trying to download images from various boorus to use to train a lora. I want to make sure that I dont get duplicates. I am using "filename": "{md5}.{extension}" but I am getting "none" instead of an md5. I tried checksummd5 as well, but still get none. how do I prevent duplicates between multiple boorus using a consistent format without getting "none"?
The text was updated successfully, but these errors were encountered: