Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[furaffinity] Downloading some already downloaded files #776

Closed
kattjevfel opened this issue May 22, 2020 · 9 comments
Closed

[furaffinity] Downloading some already downloaded files #776

kattjevfel opened this issue May 22, 2020 · 9 comments

Comments

@kattjevfel
Copy link
Contributor

Seemingly at random (but always the same URLs, can't see any pattern tho) gallery-dl will download an already downloaded file, I compared the output of a URL that works as expected and one that re-downloads if you run the command a second time:

diff --git a/url_OK b/url_BROKEN
index 85860b8..1fd8e79 100644
--- a/url_OK
+++ b/url_BROKEN
@@ -1,8 +1,9 @@
 [gallery-dl][debug] Version 1.14.0-dev
 [gallery-dl][debug] Python 3.8.3 - Linux-5.6.14-zen1-1-zen-x86_64-with-glibc2.2.5
 [gallery-dl][debug] requests 2.23.0 - urllib3 1.25.9
-[gallery-dl][debug] Starting DownloadJob for 'https://www.furaffinity.net/view/36457374'
-[furaffinity][debug] Using FuraffinityPostExtractor for 'https://www.furaffinity.net/view/36457374'
+[gallery-dl][debug] Starting DownloadJob for 'https://www.furaffinity.net/view/32690490'
+[furaffinity][debug] Using FuraffinityPostExtractor for 'https://www.furaffinity.net/view/32690490'
 [urllib3.connectionpool][debug] Starting new HTTPS connection (1): www.furaffinity.net:443
-[urllib3.connectionpool][debug] https://www.furaffinity.net:443 "GET /view/36457374/ HTTP/1.1" 200 None
-/mnt/jupiter/Temp/gallery-dl/furaffinity/nommz/1590100397.nommz_maypatreonpreview.png
\ No newline at end of file
+[urllib3.connectionpool][debug] https://www.furaffinity.net:443 "GET /view/32690490/ HTTP/1.1" 200 None
+[urllib3.connectionpool][debug] Starting new HTTPS connection (1): d.facdn.net:443
+[urllib3.connectionpool][debug] https://d.facdn.net:443 "GET /art/nommz/1565998617/1565998617.nommz_27.png HTTP/1.1" 200 237541

(both files exist in output directory, no new files appear)

@biznizz
Copy link

biznizz commented May 22, 2020

I'm sure it has to do with the fact that there are times when they've re-encoded filetypes for size or something.

I cant tell you how many times I've seen periods from a sizable gallery, while getting ripped, go from being identified initially as a PNG, to being ripped as a JPG, because of whatever crap the admin team was doing at the time.

It's able to get PNG's, but stuff from, say like 3 years ago, will have a period where that happens. So maybe, the ripper gets confused by this behavior. I've seen it happen to with stuff I've already got as well.

@mikf
Copy link
Owner

mikf commented May 25, 2020

This happens because of downloader.http.adjust-extension and how it's currently implemented.

https://www.furaffinity.net/view/32690490 appears to be a .png file before it is downloaded, gallery-dl checks if a file with the same name/extension exists, doesn't find one, and downloads the file. Afterwards it checks the first few magic bytes, realizes it's actually a .jpg, and renames it, meaning it won't find a .png version the next time it is told to download the same file.

@kattjevfel
Copy link
Contributor Author

TIL downloader.http.adjust-extension is a thing, in the past I've just run something like rename -v ".png" ".jpg" $(file *.png | grep "JPEG" | awk '{print $1}' | tr -d ':' | tr '\n' ' ') on my dirs lol, was wondering why artists suddenly started saving their work properly.

Anyway is it then checked if the file exists with the new filename? Maybe it should perform another check once downloader.http.adjust-extension has been performed, and not re-download the already existing file.

@shinji257
Copy link
Contributor

I think the issue here is that adjust-extension happens after it has been downloaded.

This actually explains the issue I'm having with e-hentai/exhentai because all retrieved items have no extension then are given an extension based on the content.

mikf added a commit that referenced this issue Nov 29, 2020
Check file headers against a list of file signatures before
downloading the whole file and writing it to disk.

The file signature check needs some improvements (*),
but it produces usable results for the most part.

(*)
- 'webp', 'wav', and others start with 'RFFI'
- 'svg' uses the same "signature" as all XML documents
- 'webm' has the same signature as 'mkv' files
- only 'mp3' files in an ID3v2 container get recognized
@mikf
Copy link
Owner

mikf commented Nov 29, 2020

Should be fixed in 536c088, but these are some rather significant changes to the HTTP downloader code and there are most likely some bugs in there. I'll leave this open for now so you can report any crashes etc in here.

The file signature check also needs some work, but it is good enough for now.

@kattjevfel
Copy link
Contributor Author

Sadly it completely breaks deviantart #1144

@J20X9
Copy link

J20X9 commented Dec 3, 2020

There seems to be an issue with JPEG extensions, gallery-dl doesn't recognize previously downloaded images that have a .jpeg file extension. I just upgraded to the latest dev version and images that were previously downloaded with .jpeg extensions are now downloaded with .jpg extensions. This would result in a lot of duplicates, every old .jpeg image would now have an identical .jpg copy.

An example of this can be seen with gallery-dl https://baraag.net/@satori/3 with a filename template of {id}.{extension}. It previously downloaded the image as 3.jpeg, while repeating the command would result in gallery-dl recognizing that the image was already downloaded and skipping it. Now the new gallery-dl version ignores the 3.jpeg file downloaded by the previous version and it instead downloads a new copy of the image as 3.jpg.

This could be resolved by running something like rename -n 's/.jpeg/.jpg/' ./** before running the new gallery-dl version, replacing all .jpeg's with .jpg's so that gallery-dl will recognize them. However, I would like to know whether the above behavior is intended to remain or not. It would be a bad idea to run that rename command if gallery-dl is patched so that it retains .jpeg extensions.

@mikf
Copy link
Owner

mikf commented Dec 3, 2020

@J20X9 this is intended behavior and happens because of the changes in commit 9b1bd09. You can "undo" this behavior by setting extension-map to an empty object in your config file, i.e. "extension-map": {}.

@J20X9
Copy link

J20X9 commented Dec 3, 2020

@mikf Ah, so that's what extension-map is for. I must not have used gallery-dl since that was added, or I just didn't notice until I tried out the dev version. I'll just use that rename command to clean up my old .jpeg downloads.

@mikf mikf closed this as completed Dec 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants