Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions, Feedback and Suggestions #2 #74

Closed
mikf opened this issue Jan 23, 2018 · 97 comments
Closed

Questions, Feedback and Suggestions #2 #74

mikf opened this issue Jan 23, 2018 · 97 comments

Comments

@mikf
Copy link
Owner

mikf commented Jan 23, 2018

Continuation of the old issue as a central place for any sort of question or suggestion not deserving their own separate issue.

#11 had gotten too big, took several seconds to load, and was closed as a result.
There is also https://gitter.im/gallery-dl/main if that seems more appropriate.

@Hrxn
Copy link
Contributor

Hrxn commented Jan 25, 2018

A bit of feedback:
I've been downloading some stuff from Tumblr now, and everything seems to be working like a charm so far. Kudos to you.
Including --write-unsupported FILE, detecting and saving embedded links from Vine, Instagram and links to other external sites. Safe to assume it'll work just the same with embeds from YouTube and Vimeo, which I thought to be pretty common on Tumblr, but it's entirely possible that the processed blogs so far actually don't had a single one. I try that later on my own test blog, just to be sure.

This also led me to another idea/suggestion:
How about a --write-log FILE feature?
I'm aware that there is something like tee, and even something similar for PowerShell, but I think the purpose for the logging feature (at least at default setting) would not be to replicate the full output printed to the console, no, only any relevant errors or warnings, i.e. like 404, 403 and also if there occurs some timeout or something, which can be caused easily by connectivity issues, not by gallery-dl itself.

Not sure, just an idea...

@mikf
Copy link
Owner Author

mikf commented Jan 26, 2018

Sounded like a useful feature, so I tried to put something together: 97f4f15. It should more or less behave like one would expect, but there are at least two things that might be better handled otherwise:

Edit: Never mind the points below, I thought about it and decided to change its behavior. It is no longer persistent and stores exactly the same log messages as shown on screen (c9a9664).

  • Log files are currently persistent across gallery-dl invocations and can grow indefinitely. Other options would be deleting its old content each time or even log rotations (Python has built-in support for that as well).
  • Debug log messages will never be written to a log file, even when using '--verbose'. It would be easier to copy the --verbose output from a text file instead of a console window, but writing all of the debug output to an otherwise concise log file didn't seem like a good idea.

@Hrxn
Copy link
Contributor

Hrxn commented Jan 28, 2018

One small addition to logging behaviour:

Not entirely sure without a test case at hand now, but how's the current output for images on Tumblr relying on the fallback mechanism? I don't know if I remember that correctly, but it appeared that trying to download some specific image (e.g. some old images, old URL scheme etc.) with s3.amazonaws which then resulted in an error (403) prints the corresponding messages to the terminal, but the next successful download had the same Post ID in the name, so that would be the fallback URL used here.
I hope you know what I mean, if not I'll try some random blogs again and wait for that error to appear and copy the messages here.
Because the question is, should the error be printed to the output/logged to file in this case here? Because even if the "raw" URL does not work and we run into an actual 403, the fallback URL/mechanism still works and downloads that image instead successfully, so it's a bit debatable if that would really qualify as a real "error" 😄

Edit:

Had the opportunity to think about this again, and I'm not sure if it's actually worth to bother. Sure, it may be less than ideal (Think of your average end-user©® and the reaction: "OMG It says there's some error"), but this can be solved by simply, uh, explaining the stuff. And if this really warrants to deal with errors that are actually not so much of an error, by having to handle different error types, error classes, error codes and whatever, just for the sake of what, actually, consistency (?), I'm not really sure anymore.

@mikf
Copy link
Owner Author

mikf commented Jan 28, 2018

I think it is actually worth to bother. With the way things were, it was impossible to tell if all files have been downloaded successfully by just looking at a log file, and error messages from an image with fallback URL were kind of misleading as well, since the image download in question did succeed in the end.

I added two more logging messages to hopefully remedy this (db7f04d):

  1. Failed to download <filename> when an image could not be downloaded, even when using fallback URLs.
  2. Trying fallback URL #<number> to indicate that the last error message is not fatal.

Maybe it would be better to categorize all HTTP errors as warnings and only show the Failed to download … message as definite error?

@Hrxn
Copy link
Contributor

Hrxn commented Jan 28, 2018

Maybe it would be better to categorize all HTTP errors as warnings and only show the Failed to download … message as definite error?

Yeah, sounds good.
👍

@rachmadaniHaryono
Copy link
Contributor

@mikf,

  • which extractor class from common module should i use to make CustomExtractor? is there any requirement for each extractor class?
  • can you explain the extractor class' attribute? (e.g. from BooruExtractor have 'basecategory', 'filename_fmt', 'api_url', 'per_page', 'page_start', 'page_limit', etc)
  • is there any rule on how metadata attribute on extractor class should be built, or is it up to each extractor?

@mikf
Copy link
Owner Author

mikf commented Jan 30, 2018

  1. Generally you should use the basic Extractor class, but, as always, it depends. There are some general extractor sub-classes (BooruExtractor, FoolslideExtractor, FoolfuukaExtractor, ...) and it might also be helpful to just copy an existing extractor module and adjust it to your needs.
    As for requirements: set the category, subcategory, filename_fmt, directory_fmt and pattern class attributes to some reasonable values (see, for example, slideshare.py).

  2. category and subcategory are essentially an extractor's name and are used for config-lookup.
    directory_fmt and filename_fmt are default values for the directory and filename options.
    pattern is a list of regex-strings. An extractor is used if one of them matches the given URL. The resulting match-object is the second parameter to an extractor's __init__() method.
    basecategoryhas to do with shared config values, just ignore it.

    The other attributes you listed are BooruExtractor-specific:

    • api_url: URL to send API requests to
    • per_page: number of post-entries per page
    • page_start: the first page (0 or 1 depending on site)
    • page_limit: largest valid page number
  3. You kind of asked the same thing before.
    It is up to each extractor, but similar ones should use the same key-names. For image-metadata, you should always provide the filename extension as extension or at least set it to None.

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jan 31, 2018

is there any rule on how metadata attribute on extractor class should be built, or is it up to each extractor?

You kind of asked the same thing before.
It is up to each extractor, but similar ones should use the same key-names. For image-metadata, you should always provide the filename extension as extension or at least set it to None.

actually i got the wrong impression from chan.py. i thought that to use the match object later on items method it should be stored on class' attribute metadata. but after i look at slideshare.py, any name can be used (e.g. user and presentation)


here is what i can come up with https://gist.github.com/rachmadaniHaryono/e7d40fcc5b9cd6ecc1f9151c4f0f5d84

full code https://github.com/rachmadaniHaryono/RedditImageGrab/blob/master/redditdownload/api.py

this module will not download a file, but it will only extract from url

@rachmadaniHaryono
Copy link
Contributor

@mikf can you give example for 6a07e38 ?

@mikf
Copy link
Owner Author

mikf commented Feb 2, 2018

from my_project import module_with_extractors
class SomeExtractor(Extractor):
    ...

from gallery_dl import extractor
extractor.add(SomeExtractor)
extractor.add_module(module_with_extractors)

You should use these functions instead of manually manipulating extractor._cache and relying on implementation details.

@ChiChi32
Copy link

I'm doing something wrong? And i try option in config file, don't working.

2018-02-25_000317

@Hrxn
Copy link
Contributor

Hrxn commented Feb 25, 2018

Which version of gallery-dl is that? Can you run gallery-dl -v please?

@Bfgeshka
Copy link

Can we have percent-encoding conversions for saved files? I.e. replacing %20 in filename with whitespace, %22 with ", etc.

@ChiChi32
Copy link

@Hrxn , I'm a bit embarrassed ... I found a strange thing. I have 2 folders, gallery_dl and gallery_dln. The first is the old version 1.1.2, the second is 1.2.1. Both are in the same directory. When I run any command using the bat file from the folder with the new version, the modules are taken from the old one. When I run -version from the 1.2.1 folder, 1.1.2 is displayed. I do not think that this is a problem program, rather Windous or Python. I apologize for the disturbance.

@mikf
Copy link
Owner Author

mikf commented Feb 25, 2018

@ChiChi32 the __main__.py file expects to sit inside a directory named gallery_dl.
In your specific case it adds F:\Python to its PYTHONPATH environment and then imports the gallery_dl package, which is the older 1.1.2 version.
If you want to use multiple versions at the same time, you could try a directory-structure like

Python
|- gallery-dl-1.1.2
|  \- gallery_dl
|     |- __main__.py
|     |- ...
\- gallery-dl-1.2.1
   \- gallery_dl
      |- __main__.py
      |- ...

@Bfgeshka Sure, I think I'll add another conversion option for format strings to let users unquote the "offending" parts of a filename.
These percent-encoding conversions (and similar) for each metadata-field are usually already handled as necessary. Where did you find something that hasn't been properly converted?

@Bfgeshka
Copy link

@mikf I encountered it in direct link download.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 22, 2018

Some small thing I've noticed. Not a real issue deserving of a ticket, I presume.
But still curious what it means, or what is the cause behind it.

PS E:\> gallery-dl.exe -v 'https://gfycat.com/distortedmemorableibizanhound'
[gallery-dl][debug] Version 1.3.1
[gallery-dl][debug] Python 3.4.4 - Windows-10-10.0.16299
[gallery-dl][debug] requests 2.18.4 - urllib3 1.22
[gallery-dl][debug] Starting DownloadJob for 'https://gfycat.com/distortedmemorableibizanhound'
[gfycat][debug] Using GfycatImageExtractor for 'https://gfycat.com/distortedmemorableibizanhound'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): gfycat.com
[urllib3.connectionpool][debug] https://gfycat.com:443 "GET /cajax/get/distortedmemorableibizanhound HTTP/1.1" 200 None
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): giant.gfycat.com
[urllib3.connectionpool][debug] https://giant.gfycat.com:443 "GET /DistortedMemorableIbizanhound.webm HTTP/1.1" 200 18240849
* E:\Transfer\INPUT\GLDL\Anims\\fluid in an invisible box750-720p DistortedMemorableIbizanhound.webm
PS E:\>

The download seems to work, just as it's apparent above. But the output is a bit different, not what I'm used to see, just observe this output path: E:\Transfer\INPUT\GLDL\Anims\\fluid in an invisible box750- [...]

The extra backslash, as if some directory is missing in between.

I post the configuration used here:

  • The general part (keywords, and keywords-default)
{
    "base-directory": "E:\\Transfer\\INPUT\\GLDL",
    "netrc": false,

    "downloader":
    {
        "part": true,
        "part-directory": null,
        "http":
        {
            "rate": null,
            "retries": 5,
            "timeout": 30,
            "verify": true
        }
    },
    "extractor":
    {
        "keywords": {"bkey": "", "ckey": "", "tkey": "", "skey": "", "mkey": ""},
        "keywords-default": "",
        "archive": "E:\\Transfer\\INPUT\\GLDL\\_Archives\\gldl-archive-global.db",
        "skip": true,
        "sleep": 0,
[...]
  • Gfycat
        "gfycat":
        {
            "directory": ["Anims", "{bkey}", "{ckey}", "{tkey}", "{skey}", "{mkey}"],
            "filename": "{title:?/ /}{gfyName}.{extension}",
            "format": "webm"
        },

But it does not happen here, for example:

  • Imgur
        "imgur":
        {
            "image":
            {
                "directory": ["{bkey}", "{ckey}", "{tkey}", "{skey}", "{mkey}", "Images"],
                "filename": "{title:?/ /}{hash}.{extension}"
            },
            "album":
            {
                "directory": ["{bkey}", "{ckey}", "{tkey}", "{skey}", "{mkey}", "Albums", "{album[title]:?/ /}{album[hash]}"],
                "filename": "{album[hash]}_{num:>03}_{hash}.{extension}"
            },
            "archive": "E:\\Transfer\\INPUT\\GLDL\\_Archives\\gldl-archive-imgur.db",
            "mp4": true
        },

Single Image:

PS E:\> gallery-dl.exe -v 'https://imgur.com/5m4CFZS'
[gallery-dl][debug] Version 1.3.1
[gallery-dl][debug] Python 3.4.4 - Windows-10-10.0.16299
[gallery-dl][debug] requests 2.18.4 - urllib3 1.22
[gallery-dl][debug] Starting DownloadJob for 'https://imgur.com/5m4CFZS'
[imgur][debug] Using ImgurImageExtractor for 'https://imgur.com/5m4CFZS'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): imgur.com
[urllib3.connectionpool][debug] https://imgur.com:443 "GET /5m4CFZS HTTP/1.1" 200 49800
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): i.imgur.com
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /5m4CFZS.png HTTP/1.1" 200 1940189
* E:\Transfer\INPUT\GLDL\Images\Caley Rae Pavillard 5m4CFZS.png
PS E:\>

Album:

PS E:\> gallery-dl.exe -v 'https://imgur.com/a/jQxtc'
[gallery-dl][debug] Version 1.3.1
[gallery-dl][debug] Python 3.4.4 - Windows-10-10.0.16299
[gallery-dl][debug] requests 2.18.4 - urllib3 1.22
[gallery-dl][debug] Starting DownloadJob for 'https://imgur.com/a/jQxtc'
[imgur][debug] Using ImgurAlbumExtractor for 'https://imgur.com/a/jQxtc'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): imgur.com
[urllib3.connectionpool][debug] https://imgur.com:443 "GET /a/jQxtc/all HTTP/1.1" 200 62847
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): i.imgur.com
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /t9CD48N.jpg HTTP/1.1" 200 126079
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_001_t9CD48N.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /VoGBS4N.jpg HTTP/1.1" 200 148669
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_002_VoGBS4N.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /svbJXyy.jpg HTTP/1.1" 200 146013
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_003_svbJXyy.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /kDjvkrD.jpg HTTP/1.1" 200 130492
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_004_kDjvkrD.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /GxPVJSw.jpg HTTP/1.1" 200 154477
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_005_GxPVJSw.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /tUIUbSL.jpg HTTP/1.1" 200 194268
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_006_tUIUbSL.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /vcvv1r0.jpg HTTP/1.1" 200 193132
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_007_vcvv1r0.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /YBQddcB.jpg HTTP/1.1" 200 147301
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_008_YBQddcB.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /FkuxOXZ.jpg HTTP/1.1" 200 169420
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_009_FkuxOXZ.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /MB30wRC.jpg HTTP/1.1" 200 223108
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_010_MB30wRC.jpg
[urllib3.connectionpool][debug] https://i.imgur.com:443 "GET /TGnsGoh.jpg HTTP/1.1" 200 147744
* E:\Transfer\INPUT\GLDL\Albums\jQxtc\jQxtc_011_TGnsGoh.jpg
PS E:\>

Will add some more tests eventually, to see if I can get any different results with various input file options.

But so far, it seems to have something to do with
"directory": ["{bkey}", .... ..., output path beginning with my custom keyword
vs.
"directory": ["Anims", "{bkey}",... ... output path starts with a fixed directory.

@mikf
Copy link
Owner Author

mikf commented Mar 22, 2018

This happens because of os.path.join()'s behavior when using an empty string as the last argument:

>>> from os.path import join
>>> join("", "d1", "", "d2")
'd1/d2'
>>> join("", "d1", "", "d2", "")
'd1/d2/'

It adds a slash (or back-slash on Windows) to the end if the last argument is an empty string.

I've been using path = directory + separator + filename to build the final complete path with the assumption that all directories don't have a path-separator at the end, which, in your case, resulted in two of them ("...\Anims\" + "\" + "fluid in an invisible box...").

mikf added a commit that referenced this issue Mar 22, 2018
@Hrxn
Copy link
Contributor

Hrxn commented Mar 22, 2018

Ah, thanks. Makes sense. This behaviour of os.path.join() is again something which makes me wonder if such behavior is intentional, or if it is just some quirk.
This time at least it's a quirk that affects all platforms in the same way, right? 😄

Edit:
Ha, my mistake. It's probably intentional, I see where this could be useful.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 23, 2018

BTW, everything works with the latest commit, example URL above is correct and did not encounter it anywhere else!

@reversebreak
Copy link

Just something quick I noticed - gallery-dl appears to be unable to handle certain emoji appearing in captions on tumblr (and maybe elsewhere??).
(Warning - the post I was able to trigger this with this on is rather nsfw)
Running --list-keywords on an offending post with --verbose and piping the error output with 2> to a file gets me

[gallery-dl][debug] Version 1.3.2
[gallery-dl][debug] Python 3.4.4 - Windows-7-6.1.7601-SP1
[gallery-dl][debug] requests 2.18.4 - urllib3 1.22
[gallery-dl][debug] Starting KeywordJob for 'http://aurahack18.tumblr.com/post/172338300565'
[tumblr][debug] Using TumblrPostExtractor for 'http://aurahack18.tumblr.com/post/172338300565'
[urllib3.connectionpool][debug] Starting new HTTPS connection (1): api.tumblr.com
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/aurahack18.tumblr.com/info?api_key=O3hU2tMi5e4Qs5t3vezEi6L0qRORJ5y9oUpSGsrWu8iA3UCc3B HTTP/1.1" 200 371
[urllib3.connectionpool][debug] https://api.tumblr.com:443 "GET /v2/blog/aurahack18.tumblr.com/posts?reblog_info=true&id=172338300565&api_key=O3hU2tMi5e4Qs5t3vezEi6L0qRORJ5y9oUpSGsrWu8iA3UCc3B&offset=0&limit=50 HTTP/1.1" 200 1374
[tumblr][error] An unexpected error occurred: UnicodeEncodeError - 'cp932' codec can't encode character '\u2661' in position 3: illegal multibyte sequence. Please run gallery-dl again with the --verbose flag, copy its output and report this issue on https://github.com/mikf/gallery-dl/issues .
[tumblr][debug] Traceback
Traceback (most recent call last):
File "E:\gallery-dl\gallery_dl\job.py", line 64, in run
File "E:\gallery-dl\gallery_dl\job.py", line 117, in dispatch
File "E:\gallery-dl\gallery_dl\job.py", line 131, in handle_urllist
File "E:\gallery-dl\gallery_dl\job.py", line 236, in handle_url
File "E:\gallery-dl\gallery_dl\job.py", line 279, in print_keywords
UnicodeEncodeError: 'cp932' codec can't encode character '\u2661' in position 3: illegal multibyte sequence

This is using the executable download, btw.

@Hrxn
Copy link
Contributor

Hrxn commented Apr 4, 2018

You are running this via CMD.exe I presume?

What happens if you do this first in CMD: chcp 65001
And then run gallery-dl?

Edit:

Or try to use Powershell. I've moved completely to Powershell by now as well..

@mikf
Copy link
Owner Author

mikf commented Apr 4, 2018

This is a more general problem with the interaction between Windows, the Python interpreter, Unicode, code pages and so on.

As @Hrxn mentioned, you should be able to work around this yourself by changing the default code page to UTF-8 via chcp 65001. Another way is to set the PYTHONIOENCODING environment variable to utf-8 before running gallery-dl:

E:\>set PYTHONIOENCODING=utf-8
E:\>py -3.4 -m gallery_dl -K http://aurahack18.tumblr.com/post/172338300565
...

Python 3.6 and above also doesn't have this problem (it implements PEP 528), so using this instead of the standalone exe might be another option.

I tried to implement a simple workaround in 0381ae5 by setting the default error handler for stdout and co. to replace all non-encodable characters with a question mark. Tested this on my Windows 7 VM with Python3.3 to 3.6 and it seems to work.

@reversebreak
Copy link

reversebreak commented Apr 7, 2018

Thanks. I got Python 3.6, installed it with PIP as per instructions and it doesn't crash anymore. I put Python on my PATH and it's the same user experience as using the EXE anyways.

Now, I might be missing something, but is there any way to extract "gallery order" for use in the filenames from Tumblr?
Right now, I can point gallery-dl at a tumblr post, and it gets all the pictures fine - but sometimes that means for short two-to-ten page comic posts the files are downloaded out of order.
After doing some testing and poking around it seems that the order that the files display in a photo post isn't necessarily the same as the order of their filenames (or the 'name' parameter).

You can test this yourself by creating a photo post, uploading say four photos one at a time, then saving the post.
Go back and edit the post, and drag photo 2 before photo 1, and save it again.
If you use CTRL+RIGHTCLICK+"View Image" on each in turn your tabs should go to filenames based off of _o2, _o1, _o3, _o4.
If you point gallery-dl at it it'll download it just fine, but the end result will sort as _o1, _o2, _o3, _o4.
This will show the gallery in the wrong order, which is terrible for comic posts.
Unusually, gallery-dl seems to download them in _o2, _o1, _o3, _o4 order, (according to the on-screen status), but I can't see it exposing that order to the user in parameters anywhere.

In addition to this, I don't see a way to extract the 'core name' of a file for use in the extractor.*.filename parameter.
Tumblr filenames when downloaded without a filename parameter come out something like
tumblr_fakeusername_12345678912o1.jpg
However, using the filename parameter to add extra stuff to the filename means you can't get that clean end anymore.
The closest parameter is 'name', which comes out something like
tumblr_12345678912o1_r1_1280
or similar, when all you really want is the 12345678912o1 that only gallery-dl's default naming scheme seems to get access to.

@Hrxn
Copy link
Contributor

Hrxn commented Apr 7, 2018

Right now, I can point gallery-dl at a tumblr post, and it gets all the pictures fine - but sometimes that means for short two-to-ten page comic posts the files are downloaded out of order.
After doing some testing and poking around it seems that the order that the files display in a photo post isn't necessarily the same as the order of their filenames (or the 'name' parameter).

Yes, I know what you mean. This was never a problem for me so far, because I've only downloaded picture sets that are just, well, a set of pictures, apparently, so the order was actually not relevant. But I agree, it's entirely different for something like a comic strip.

The filenames in a set post do not reflect the displayed order of the elements, as you already said. You also stated the reason for this, if you make a picture post and upload some files, they get generated names in this order. But this can be rearranged now, changing the order of the displayed items. What happens is that the structure you see in the end (in HTML) has the rearranged order as done by the creator of the post, but the filenames keep being the same as they were at the upload.

If you point gallery-dl at it it'll download it just fine, but the end result will sort as _o1, _o2, _o3, _o4.
This will show the gallery in the wrong order, which is terrible for comic posts.

I assume what you mean with end result here is the order of the actual downloaded files. Yes, that is how they are sorted by the filesystem, in "natural" order.

Unusually, gallery-dl seems to download them in _o2, _o1, _o3, _o4 order, (according to the on-screen status), but I can't see it exposing that order to the user in parameters anywhere.

Now this bit is really interesting. Because gallery-dl just takes what it gets from the API, and this seems to indicate that the API returns the single elements in the correct order, i.e. as rendered in a browser.
This would be good, because I think this is something which could be fixed without jumping through any hoops. So gallery-dl can avoid to fetch the HTML for a post entry and extract the order from there..

In addition to this, I don't see a way to extract the 'core name' of a file for use in the extractor.*.filename parameter.

Not exactly sure what you mean here.
The filename (standard) is this:

filename_fmt = "{category}_{blog_name}_{id}o{offset}.{extension}"

Compared to your example at the end, the number in the filename is {id}, and that oX part is {offset}.

Small side note: This is a bit confusing within the source code, because "offset" is also used as the name for the parameter to retrieve the posts from the API.

@reversebreak
Copy link

Ah, yes I did mean the final order as sorted by the filesystem - since there's no way right now to get 'gallery order' from gallery-dl, then the only ordering the filesystem has to go off of is the filename with the oX at the end (the "offset", as you said).

Just as a note, I haven't done thorough testing on all cases of the reordered gallery, so I haven't proven the ordering comes out like that in all cases.
Assuming it does, then I expect it'd be pretty easy to put a counter on the loop that downloads photos in posts.

Ah, sorry, I was using several different posts for testing and got confused about the outputs. I didn't notice the ID parameter was being used for the default - I thought I was getting a short form of the name parameter at the end of the default filename.

@AnyByte
Copy link

AnyByte commented Sep 17, 2018

Try converting images with odd number in width or height and you will get this error without any flags, but yeah maybe you're right that this should remain optional as if user wanted to change output video scale this could result in conflict and would need additional handling.

For example with this command ffmpeg -r 1/5 -i %06d.jpg out.mp4 and image dimensions being 1000x843 im getting:

[libx264 @ 000002053424e380] height not divisible by 2 (1000x843)
Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height
Conversion failed!

And if using gallery-dl it just generates an empty mp4 file.

@mikf
Copy link
Owner Author

mikf commented Sep 17, 2018

I did, and works fine. Maybe it depends on the FFmpeg version?

$ gallery-dl --ignore-config --ugoira-conv https://danbooru.donmai.us/posts/3251265 
ffmpeg version n4.0.2 Copyright (c) 2000-2018 the FFmpeg developers
  built with gcc 8.2.0 (GCC)
...
Output #0, webm, to './gallery-dl/danbooru/danbooru_3251265_61363f26c6265130594cbeb9a5d53c63.webm':
  Metadata:
    encoder         : Lavf58.12.100
    Stream #0:0: Video: vp9 (libvpx-vp9), yuv420p, 675x675 [SAR 1:1 DAR 1:1], q=-1--1, 200 kb/s, 8 fps, 1k tbn, 8 tbc
    Metadata:
      encoder         : Lavc58.18.100 libvpx-vp9
    Side data:
      cpb: bitrate max/min/avg: 0/0/0 buffer size: 0 vbv_delay: -1
...

@AnyByte
Copy link

AnyByte commented Sep 17, 2018

yeah it works fine with --ugoira-conv but with this config:

"danbooru":
{
    "postprocessors":
    [{
        "name": "ugoira",
        "extension": "mp4",
        "keep-files": true,
        "ffmpeg-output": false
    }]
},

it doesnt

@wankio
Copy link
Contributor

wankio commented Sep 19, 2018

can pixiv extractor support illust_id ?
www.pixiv.net/member_illust.php?mode=medium&illust_id=xxxxxxxx ty

@mikf
Copy link
Owner Author

mikf commented Sep 19, 2018

That has already been supported since 2a97296.
You need to put the URL between double quotation marks "..." because it contains an ampersand &. That's probably why it doesn't work for you.

@wankio
Copy link
Contributor

wankio commented Sep 19, 2018

oh thank, i dont even know it already supproted

but if people not ask, how they know to use it "..." when trying to download single pixiv post ? did we missing guide sector? like filter to ignore something, use --range to download from specific pics,...

@Hrxn
Copy link
Contributor

Hrxn commented Sep 20, 2018

Not sure what you mean?

Using quotation marks for URLs is standard practice for any shell, basically. One should always use it.

And what about a single post? gallery-dl always works in the same fashion, if you browse Pixiv, or any other site for that matter, you just click on that posting you want and the URL to use for gallery-dl is what is displayed in the browser's address bar (You can also right-click on that link element while surfing and copy the link from there, of course).

@ChiChi32
Copy link

Hello again... Maybe I'll ask a stupid question now, but does the blacklist in "extract.recursive.blacklist" work only for this extractor or can it be used for deviantart for example? And if it does not work, is it planned for a blacklist for keywords for extractors?

@Hrxn
Copy link
Contributor

Hrxn commented Sep 24, 2018

This is not about blacklisting keywords, this is only to prevent recursive usage of extractors specified there..

@mikf
Copy link
Owner Author

mikf commented Sep 24, 2018

@ChiChi32
I believe what you are looking for is the --filter command-line option.

@wankio
--filter and --range are explained in the output of gallery-dl --help, although a few more examples in the README wouldn't hurt, I guess.

@ChiChi32
Copy link

@mikf I know about this option (although I admit I forgot about it), but ... 1. The command line is inconvenient if you need to filter out a few words. 2. All sites will be filtered, which is not very suitable.
And this is not criticism, but in some cases the documentation is very fragmented. About -filter and its options many probably do not even guess if you did not use --help.

@wankio
Copy link
Contributor

wankio commented Oct 5, 2018

on deviantart, how i can disable download zip file when i'm enabled "original": true , thank :) ..i only saw filter with image

@Hrxn
Copy link
Contributor

Hrxn commented Oct 5, 2018

Not really possible, I think.. If the actual linked "original" is indeed a ZIP archive file, this is what you're supposed to get.

I'm not surprised to see this, I've seen all kinds of different file formats uploaded on DA.

But if you could provide an example link to such an Deviation entry, it sure would help 😄

@wankio
Copy link
Contributor

wankio commented Oct 5, 2018

https://www.deviantart.com/oofiloo/art/GF-PROMETHEUS-NO-ARMOR-ORIGINAL-BONES-472067552
if it found zip or other extensions, download both zip,ex... + image or only image

i need to run it twice with and w/o original to download full sample/image, it's not a problem but i think it should have a filter, if original is zip or other, download preview image instead

@taiyu-len
Copy link

Pretty nice program, though i have come across a few minor issues

In the json output from --dump-json it uses numbers for big integers (twitters tweet_id and such)
which is a bit of a problem because json implementations are not required to precisely handle such large numbers, the program jq for example would output different values
having it output big ints as strings would be preferable.

another issue ive had is extractor.pixiv.directory (and perhaps other sites)
does not have the same keywords available for use in the format string as filename does.
which makes organizing a bit more difficult, post processing could work but its probably not as reliable.

is there a way to save the metadata alongside the downloaded images, and if not perhaps a extractor.*.metadata which works like filename would be nice.
like before, this could probably be done manually with post processing i guess.

@mikf
Copy link
Owner Author

mikf commented Oct 9, 2018

@taiyu-len

having it output big ints as strings would be preferable.

output.num-to-str (48a8717), but it converts all numeric values to strings. Would it be preferable to only convert integers > 2**52 and < -2**52, i.e. anything a double can't handle without losing information, or is it OK like this?

does not have the same keywords available for use in the format string as filename does.

That's by design, because in most cases the extra file-specific metadata is only its filename and extension from the download URL and maybe its width and height, so nothing you would want/need to create an extra directory for. Instead of re-evaluating the entire format string and calling makedirs() for each file, I wanted to have it only done once and then put every file into that one target directory.

is there a way to save the metadata alongside the downloaded images

Not yet, but I'm working on something.

@taiyu-len
Copy link

taiyu-len commented Oct 11, 2018

@mikf

is it OK like this?

should work well, can always convert strings to numbers if theres ever a need for doing operations on the values.

That's by design, because in most cases the extra file-specific metadata is only its filename and extension from the download URL and maybe its width and height, so nothing you would want/need to create an extra directory for.

in my case for pixiv files, i use the illust_id and title for subdirectories, which is useful for multi image galleries.
checking the source, the keywords for the directory come from this section, but only exposes the user section.

for the PixivWorkExtractor (havent looked at bookmark/favorite extractor yet)
this change would make the rest of the Illust object available for use in the directory naming,
though it does include some other junk like urls. but the original version does too with the profile picture url.
could filter it out, or just leave it up to the user whether they want to include page_views or whatever dumb things in the directory name.

diff --git a/gallery_dl/extractor/pixiv.py b/gallery_dl/extractor/pixiv.py
index 0005f92..8491bf7 100644
--- a/gallery_dl/extractor/pixiv.py
+++ b/gallery_dl/extractor/pixiv.py
@@ -214,7 +214,7 @@ class PixivWorkExtractor(PixivExtractor):

     def get_metadata(self, user=None):
         self.work = self.api.illust_detail(self.illust_id)
-        return PixivExtractor.get_metadata(self, self.work["user"])
+        return self.work;

also pixiv sending me an email everytime i use this is a bit annoying,
pixivs help says it should recognize the connection after the first time, but it doesnt seem to be doing that. maybe i need to use it more.

@AKluge
Copy link

AKluge commented Oct 15, 2018

Hi, I have a question to the config-file / filter. I want to exclude tags on pixiv and maybe other sides.

After running the gallery-dl -K option, I get the tags[] array with \uXXXX char-codes and when i try to run e.g:
gallery-dl --filter "tags[-1] != \uXXXX\uXYYY..." URL (the ignored tag was the last in the array)
I get the warning:
[gallery-dl][warning] unexpected character after line continuation character (image filter, line 1)
and the ignored files are still downloaded

I tried other tags[] related queries but none worked. Can you please explain what I'm missing or is it a bug? Anyway thanks in advance

Edit: I added one tag to the Mute-list on pixiv and the is_muted attribute change to true and the --filter "is_muted == False" ignored the muted files, but in the config-file the "image-filter":"is_muted == False" statement is still not working

@mikf
Copy link
Owner Author

mikf commented Oct 16, 2018

@AKluge
You are missing quotation marks around the string you want to test against.
It should be "tags[-1] != '\uXXXX\uXYYY...'" , i.e. single quotes around the \uXXXX\uXYYY part.

If you want to test if a string is inside a list, you can use the in operator: "'foo' in tags"

--filter "is_muted == False" ignored the muted files, but in the config-file the "image-filter":"is_muted == False" statement is still not working

The command-line option does exactly the same as the image-filter config option (*), so maybe you did something wrong here, but what you posted looks alright.

I've been using this configuration to test and it works just fine:

{ "extractor": {
    "pixiv": {
        "image-filter": "is_muted == False",
        "username": "foo",
        "password": "bar"
    }
} }

I would also recommend "not is_muted" to test if something is false. This would also work if is_muted is None or an empty string/list.

(*) it effectively sets a global value for image-filter and therefore overwrites all image-filter entries from a config file

@taiyu-len

also pixiv sending me an email everytime i use this is a bit annoying,

Shouldn't happen anymore: 8faf03e
I've never noticed this myself, since all email addresses for my Pixiv accounts are (now) invalid.

Concerning directory metadata, you can put a yield Message.Directory, work before any yield Message.Url, url, work to "fix" it, like so:

diff --git a/gallery_dl/extractor/pixiv.py b/gallery_dl/extractor/pixiv.py
index 115b1fb..8716a2d 100644
--- a/gallery_dl/extractor/pixiv.py
+++ b/gallery_dl/extractor/pixiv.py
@@ -31,7 +31,6 @@ class PixivExtractor(Extractor):
         metadata = self.get_metadata()

         yield Message.Version, 1
-        yield Message.Directory, metadata

@@ -55,11 +54,13 @@ class PixivExtractor(Extractor):
                     "_ugoira600x600", "_ugoira1920x1080")
                 work["frames"] = ugoira["frames"]
                 work["extension"] = "zip"
+                yield Message.Directory, work
                 yield Message.Url, url, work

             elif work["page_count"] == 1:
                 url = meta_single_page["original_image_url"]
                 work["extension"] = url.rpartition(".")[2]
+                yield Message.Directory, work
                 yield Message.Url, url, work

@@ -67,6 +68,7 @@ class PixivExtractor(Extractor):
                     url = img["image_urls"]["original"]
                     work["num"] = "_p{:02}".format(num)
                     work["extension"] = url.rpartition(".")[2]
+                    yield Message.Directory, work
                     yield Message.Url, url, work

But keep in mind that, as for right now, gallery-dl isn't optimized for this kind of thing. Usually general metadata for directories and specialized metadata for filenames is more than enough.

@wankio
You can now set deviantart.original to "images" to only download original image files and fall back to the preview version otherwise (d8492df)

@wankio
Copy link
Contributor

wankio commented Nov 17, 2018

sankaku: HTTP request failed: ('Connection broken: OSError("(10054, \'WSAECONNRESET\')")', OSError("(10054, 'WSAECONNRESET')"))

why this happen ? i thought it must re-download 10 times before timeout ?? ty

@mikf
Copy link
Owner Author

mikf commented Nov 18, 2018

Seems like Sankaku unexpectedly aborted your connection to its servers and the underlying requests- and urllib3-libraries reported it as a weird/unrecognized exception, so gallery-dl stopped its extraction.

It does retry a HTTP request (up to 10 times for Sankaku) if the error is reported as ConnectionError or Timeout, but this appears to be a Windows specific "connection reset" error I havn't seen before. In the meantime just retry the same URL and hope it doesn't happen again ...

@reversebreak
Copy link

I have ffmpeg on my path, but the pixiv downloader still just downloads ugoira as ZIP and leaves them like that. I've looked through the options but can't figure out how to make them convert.
Could you assist?

@mikf
Copy link
Owner Author

mikf commented Nov 29, 2018

You need to enable the ugoira post-processor module.

One way, to see if it works, is to use --ugoira-conv:

$ gallery-dl --ugoira-conv "https://www.pixiv.net/member_illust.php?mode=medium&illust_id=71828129"

If everything works as it should, you may want to set a permanent option with better encoding settings in your config file. A working example looks something like this:

{
    "extractor":
    {
        "postprocessors": [{
            "name": "ugoira",
            "whitelist": ["pixiv", "danbooru"],
            "extension": "webm",
            "ffmpeg-twopass": true,
            "ffmpeg-output": true,
            "ffmpeg-args": ["-c:v", "libvpx-vp9", "-an", "-b:v", "0", "-crf", "30"]
        }]
    }
}

See also postprocessors and ugoira options
There are also post-processor examples in docs/gallery-dl-example.conf

@wankio
Copy link
Contributor

wankio commented Dec 14, 2018

How i can download specific months in tumblr.com/archive ?

Because when i'm using gallery-dl on big tumblr (3-10years old) 100% it will reached api limit, if i change api i must rerun it from start or from --range and it will get limit too.

TY

@mikf
Copy link
Owner Author

mikf commented Dec 14, 2018

@wankio I don't think it is really possible to request posts from a specific month or time-frame using Tumblr's API (correct me if I'm wrong).
Aren't 5000 posts, or 25000 posts when using the total daily limit, enough, assuming you are using your own API key?
Maybe you could try only requesting one specific post type at a time, i.e. only photo posts on one day, text posts the next day, etc., to space out the total amount of API calls over a couple of days.

@wankio
Copy link
Contributor

wankio commented Dec 15, 2018

yes, my own api key. now i'm using tumbex to get post link auto.

And gallery-dl can do same "--mirror" as wget ?
TY

@mikf
Copy link
Owner Author

mikf commented Jan 1, 2019

And gallery-dl can do same "--mirror" as wget ?

No, it can't. There is recursive:<URL>, as you probably know, but that does something quite different. Just use wget itself (with the appropriate filters), if you need that sort of functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests