Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions, Feedback and Suggestions #11

Closed
Hrxn opened this issue Mar 29, 2017 · 82 comments
Closed

Questions, Feedback and Suggestions #11

Hrxn opened this issue Mar 29, 2017 · 82 comments

Comments

@Hrxn
Copy link
Contributor

Hrxn commented Mar 29, 2017

A central place for these things might be a good idea.

This thread could serve as a starting point, results will eventually be collected in the project wiki, if appropriate and useful.

Edited 2017-04-15
For conciseness

Edited 2017-05-04
Removed nonsensical checklist thing


@mikf
Copy link
Owner

mikf commented Mar 31, 2017

This is actually a really good idea, especially since I'm very hesitant/lazy about documenting things or writing text in general.

edit: The more I think about it, the less satisfied I am with the previous explanation, so here is version 2.

  • Extractors produce a collection of key-value pairs. To find out what these look like, use the --list-keywords option:
$ gallery-dl --list-keywords http://www.pixiv.net/member_illust.php?id=11
Keywords for directory names:
artist-id:   11
artist-name: pixiv事務局
artist-nick: pixiv
category:    pixiv
subcategory: user

Keywords for filenames:
age_limit:       all-age
artist-id:       11
artist-name:     pixiv事務局
artist-nick:     pixiv
book_style:      right_to_left
...
category:        pixiv
content_type:    None
created_time:    2017-03-31 13:50:53
extension:       jpg
favorite_id:     0
height:          865
id:              62178245
...
  • These key-value pairs are used to generate directory- and filenames by plugging them into format strings. For directories this is a list of format strings to work around the different path segment separators in Windows and UNIX systems (backslash \ or slash /).

  • Each extractor has a default format string for directory- and filenames. For pixiv this is

    directory_fmt = ["{category}", "{artist-id}-{artist-nick}"]
    filename_fmt = "{category}_{artist-id}_{id}{num}.{extension}"
  • The default values can be overwritten in your configuration file by setting the appropriate directory and filename values.
{
    "extractor":
    {
        "pixiv":
        {
            "directory": ["my pixiv images", "{artist-id}"],
            "filename": "{id}.{extension}"
        }
    }
}
  • The category of each extractor is a keyword supplied in every key-value pair collection. It can therefore be used in every format string and has been chosen to be the first segment of every default format string for directory names, but that can, as explained above, be changed.

(edit end)

If something still doesn't make sense, just tell me and I will try to explain this a bit better.

@Hrxn
Copy link
Contributor Author

Hrxn commented Apr 3, 2017

Very good to know, thank you.

Checked some profiles with --list-keywords, very useful, and returns exactly what expected. Everything according to plan, at least on the extraction side :)

I realized what caused the slight confusion (for me): The default format string set by the extractor gets overwritten by the output format defined in gallery-dl.conf, got that, all working as expected so far.

What put me a bit off was this:
https://github.com/mikf/gallery-dl/blob/master/gallery-dl.conf#L17-L30

Because pixiv seems to be a bit of a special case here.
Defining two different formats for directory, because pixiv makes use of two different "sub-extractors" ( for lack of a better word): "user": {..}, and "bookmark": {..}

I think these are called objects in JSON parlance..

Now, if I want to use my own directory and filename values in gallery-dl.conf,
along the lines of your given example:

{
    "extractor":
    {
        "pixiv":
        {
            "directory": ["my pixiv images", "{artist-id}"],
            "filename": "{id}.{extension}"
        }
    }
}

I put these two definitions into the "pixiv" object, that is, one level above the "user" and "bookmark' objects, right? This way, both definitions from each object get overwritten with the customized output format. A bit non-obvious, but this might just be me. And as long as it's working, nothing to complain here ;-)

@mikf
Copy link
Owner

mikf commented Apr 3, 2017

Because pixiv seems to be a bit of a special case here.

What you have discovered here is true for all extractors, especially those with more than one extractor per module, and not just pixiv. In general the configuration value located the "deepest" inside the dictionary- or object-tree is used. If non is found, the config system falls back to the default value.

An example:

{
    "extractor":
    {
        "pixiv":
        {
            "user": { "filename": "A" },
            "filename": "B"
        },
        "deviantart":
        {
            "image": { "filename": "C"}
        },
        "filename": "D"
    }
}

With a configuration file like the one above, the following is going to happen:

  • The pixiv.user extractor will use "A"
  • All other pixiv extractors will use "B"
  • The deviantart.image extractor will use "C"
  • All other extractors, including the other deviantart extractors, will use "D"
  • None will use their default value

I put these two definitions into the "pixiv" object, that is, one level above the "user" and "bookmark' objects, right? This way, both definitions from each object get overwritten with the customized output format.

Yes, if you have those two definitions at this place, then all pixiv extractors (there are 4 in total) will use these instead of their default format strings.

If you want to dig even deeper, take a look at the inner loop of the config.interpolate function. For example for the pixiv.user extractor this function gets called like so:

    directory = config.interpolate(["extractor", "pixiv", "user", "directory"], default)

This function first searches the top-most level for a value with key "directory" and stores this value if it finds it. It then descends into the "extractor" object and, again, searches this level for a value with key "directory". The same goes on with "pixiv" and "user" until it finally reaches the end.
If at any point something goes wrong and an exception gets thrown, which happens if for example the "pixiv" object doesn't exist, then the value stored up to this point gets returned.

@Hrxn
Copy link
Contributor Author

Hrxn commented Apr 3, 2017

Okay, got it. Also, found all 4 pixiv extractors ;-)

Very nice, and very flexible. Ultimately, every possible variant can be customized. Excellent.

Just threw some pixiv URLs at the program, can confirm everything works indeed as described! (Including these multiple images per entry/"work", I did a manual recount ;-)

On to the next one..

Okay, this probably is a newbie question, but it looks like exhentai isn't a real site? There is e-hentai, seems like they are related (sister sites?). And you apparently need an e-hentai account first (and some dark magic, probably) before you can use exhentai. I will read a bit into this first.

Pretty sure that is the first time I've ever encountered something like this.

But this theory has a little flaw: If these two sites are indeed related, I'd assume that they don't differ much on the technical side, if at all. But trying some gallery links from e-hentai got me this:

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047429/525823ef87/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047429/525823ef87/'

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047407/f00ba6d6cf/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047407/f00ba6d6cf/'

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047272/a003dfb22b/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047272/a003dfb22b/'

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047010/d8b62a3c87/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047010/d8b62a3c87/'

C:\Users\Hrxn>gallery-dl "https://e-hentai.org/g/1047424/0218b04f9c/"
[gallery-dl][error] No suitable extractor found for 'https://e-hentai.org/g/1047424/0218b04f9c/'

C:\Users\Hrxn>gallery-dl --version
0.8.1-dev

C:\Users\Hrxn>

Or is there another specific reason for this?

@mikf
Copy link
Owner

mikf commented Apr 4, 2017

exhentai is basically the "dark" version e-hentai with all the non-advertiser-friendly stuff enabled.

You should be able to access this site by doing this:

  • create an account on e-hentai.org
  • wait a week or so, use the regular site a bit, maybe play a bit of hentaiverse
  • clear all your exhentai.org, e-hentai.org, forums.e-hentai.org cookies
  • re-login on e-hentai
  • visit exhentai

In the past the domain of the regular site was g.e-hentai.org and I haven't updated the extractor to also accept e-hentai.org.
You can just change the URLs a bit and replace the - with an x or put a g. in front. It all falls back to the same code that relies on having access to exhentai.

@Hrxn
Copy link
Contributor Author

Hrxn commented Apr 12, 2017

Okay, made an account, will frequent the site a bit and see how it works out then..

Can't test it before, because b603b59 changed the expression pattern, and that part works, but it's still the exhentai extractor and therefore requires credentials for authentication. Which is not really an issue, don't get me wrong.

I will test some other sites in the meantime, and will update my initial post accordingly.

@mikf
Copy link
Owner

mikf commented Apr 13, 2017

I don't know if visiting the regular site and so on is even necessary, that is just what I did when I created an account for unit testing and couldn't access exhentai immediately.

Speaking of which: I didn't want to make my unit testing accounts any more public than necessary (for, i hope, obvious reasons), but I should probably just share them with you. Take a look at this.

@Hrxn
Copy link
Contributor Author

Hrxn commented Apr 13, 2017

I don't know if visiting the regular site and so on is even necessary, that is just what I did when I created an account for unit testing and couldn't access exhentai immediately.

I'm not sure, but other random sources on the Internet indicate that this is actually the case.

Speaking of which: I didn't want to make my unit testing accounts any more public than necessary (for, i hope, obvious reasons), but I should probably just share them with you. [...]

Yes, obviously. That is nice, but it won't be necessary, I've already made an account and started using it a bit. Besides, creating and using different accounts for different sites and services doesn't really bother me at all. If there is some longer gap between my responses, it's only because I'm busy with something else ;-)
I use Keepass for handling this stuff, which is a really great program, as you probably know. It's so good, they should invent a new word for it (great cross-platform alternative: KeepassXC)


Another thing, which I think belongs here, because it's not an issue or bug, but maybe a possible suggestion:

There is another feature on DeviantArt I wasn't aware of before: The Journal.

I noticed it while using gallery-dl with this profile: http://inkveil-matter.deviantart.com/

The site states: 190 deviations.
gallery-dl download: 155 files.

Luckily, there is a statistics page which explains this:
http://inkveil-matter.deviantart.com/stats/gallery/

InkVeil-Matter has 93,840 pageviews in total; their 35 journals and the 155 deviations in their gallery were viewed 733,738 times.

35 Journal entries, so 190 in total.

Shamelessly copied from the DeviantArt Wikipedia page:

Journals are like personal blogs for the member pages, and the choice of topic is up to each member; some use it to talk about their personal or art-related lives, others use it to spread awareness or marshal support for a cause.

Not sure if that is useful at all. I clicked around a bit, and saw nothing I would consider as missing.
Embeds from their own gallery, or from any other, and some links to some drawing feature of DeviantArt I also didn't know of before: Muro, can be seen when visiting sta.sh for example, which also belongs to them as it seems.

I don't know, not sure If I even really understand this feature yet.

Anyway, forgive me my wall of text here, I just wanted to let you know, just in case this is news to you as well ;-)

@mikf
Copy link
Owner

mikf commented Apr 17, 2017

I use Keepass for handling this stuff, which is a really great program, as you probably know. It's so good, they should invent a new word for it (great cross-platform alternative: KeepassXC)

Thank you for the suggestion but I'm going to stay with my trusty GPG-encrypted plain text file :)

Another thing, which I think belongs here, because it's not an issue or bug, but maybe a possible suggestion

Even if this platform here is called an issue tracker, feel free (and even encouraged) to create new "issues" if you want to suggest or request a feature or support for a new site.

There is another feature on DeviantArt I wasn't aware of before: The Journal.

This seems to be just a collection of blog posts, which might contain references to other deviantart- or sta.sh images. There shouldn't be any images missing: 190 deviations consisting of 155 real deviations and 35 journal entries seems about right to me.
I could add an extractor to fetch those references and download all the images of a journal entry if you want me to.

Anyway, forgive me my wall of text here, I just wanted to let you know, just in case this is news to you as well ;-)

No worries, I don't mind walls of text and actually wasn't aware of the journal or muro, so thanks for telling me.

@ghost
Copy link

ghost commented May 1, 2017

Sorry asking about it, this tool have a feature to able remembering what image is already downloaded, without checking local directory? iirc on package is already have sqlite dll, right?

Thank you.

@mikf
Copy link
Owner

mikf commented May 1, 2017

No, I am sorry, but such a feature does currently not exist.
gallery-dl only skips downloads if a file with the same name already exists, but there is at this time no other way of "remembering" if an image has been downloaded before.
SQLite, as you have noted, is already being used, but that is only to cache login sessions and the like across separate gallery-dl invocations.

Feel free to open a separate issue If you want a feature like this being implemented, but please explain in greater detail what you actually want to do and/or need this feature for.

@Hrxn
Copy link
Contributor Author

Hrxn commented May 4, 2017

Just saw the new commit adding options for skipping files.

A change from fc9223c#diff-283aceda91c5f7f10981253611f9f950

    def _exists_abort(self):
        if self.has_extension and os.path.exists(self.realpath):
            raise exception.StopExtraction()
        return False

Current extractor run, in this context, means just the 'active' URL, right?
Because I'm not sure yet what the expected behaviour would be if gallery-dl is used like this:
gallery-dl --input-file FILE

Maybe a case for an additional option. Or rather not, I'm still not sure about it, need to make up my mind first probably.

@mikf
Copy link
Owner

mikf commented May 4, 2017

Current extractor run, in this context, means just the 'active' URL, right?

Yes.
Each URL gets its own extractor, so the --abort-on-skip option works for each URL independently. Aborting the run of one URL has no effect an any other URLs.

Because I'm not sure yet what the expected behaviour would be if gallery-dl is used like this

The -i/--input-file FILE option just appends the URLs inside of FILE to the end of the list of all URLs.
gallery-dl -i FILE URL1 is equivalent to gallery-dl URL1 URL2 URL3 if FILE contains URL2 and URL3.
Even if, for example, the download for URL1 gets canceled, URL2 and URL3 will still be processed normally.

Maybe a case for an additional option

An --exit-on-skip option that just exits the program on any download-skip would certainly be possible.

@Hrxn
Copy link
Contributor Author

Hrxn commented May 4, 2017

An --exit-on-skip option that just exits the program on any download-skip would certainly be possible.

Yes, for example. I think the current behaviour is just right as the default, we'll see when someone asks for other variants.

@HASTJI
Copy link

HASTJI commented Jun 4, 2017

Do you plan to add a graphical interface for the program? At least the input fields and the pause / continuation buttons. Also interesting in the possibility of multi-threading and the possibility to plan the uploads one by one via GUI. Yes, I know that it can be done through the console, but still ...

@Hrxn
Copy link
Contributor Author

Hrxn commented Jun 5, 2017

Well, I don't know, but if I may, let me add just this:
I wish people would realize how much programming work implementing a GUI actually is. And the thing is, that means actual code, lots of lines of code, only for the GUI, and this gets never used outside of the GUI again. So this is just additional work on the top, without any benefit for the actual underlying code.

@mikf
Copy link
Owner

mikf commented Jun 5, 2017

No, there are no plans for a graphical user interface, mainly because of the reasons @Hrxn listed.
A lot of the features you mentioned can already be done via a (reasonable) terminal and shell plus the (GNU) coreutils that usually come with them and I don't really want to re-implement this.
I do realize that the CLI "experience" on Windows is terrible, so maybe, if there is a big enough demand, I might add some sort of GUI in the future, but that will always be low priority.

@HASTJI
Copy link

HASTJI commented Jun 5, 2017

No, there are no plans for a graphical user interface, mainly because of the reasons @Hrxn listed.

Hmmm...I can try to build a graphical shell on C# for windows version, that will intercept commands from and into gallery-dl, but I'm not sure that it will take a little time.But I will try my best.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jun 5, 2017

Yes, I think that's a good idea.
Also, in my opinion, using CLI on Windows isn't too bad. For many uses cases, standard batch scripts (*.bat/*.cmd) should be enough, for example starting gallery-dl with multiple/dozens/hundreds of URLs, and if you need plenty of scripting capabilities, you can use gallery-dl within PowerShell.

I wouldn't even know what to use a GUI program for, to be honest. If the program is running, there isn't much to see, because what actually takes the most time is just transferring data across the net, aka downloading. You could add some fancy progress bars, but this doesn't really change anything, in my opinion. Besides, progress bar support can also be done in CLI, via simple text output written to the terminal, like wget and curl for example.

The only thing I can really think of right now is managing your personal usage history of the program, so to speak. That means having all in one central place, a queue for all URLs that are yet to be processed, and an archive of all URLs that already have been done. This would be more of a meta-program, if you think about it, because all this can be done completely independent of gallery-dl. You could also use this program to write the script files for the CLI then 😄

As a starting point, writing processed URLs to archive file(s) would be a good idea, I think. Something along the lines of the --download-archive option of youtube-dl, for example.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jun 12, 2017

Interesting, although these sites seem so similar (and the Gelbooru site even states "concept by Danbooru"), they are yet so different in terms of implementation and functionality.

I just checked again, Gelbooru support for pools may be pretty much irrelevant, at least for now. Because unlike on Danbooru, where pools are used quite extensively, Gelbooru only seems to have 25 pools in total right now, and there is not really much activity. At least that is what I see here, even with an account on Gelbooru. Although an account enables to create own pools (public and private), allowing to collect different posts there which could then be downloaded. So this might be relevant to potential gallery-dl users, maybe..

@mikf
Copy link
Owner

mikf commented Jun 12, 2017

Gelbooru only seems to have 25 pools in total right now

There seem to be up to 44500 pools If you take a look at the id parameter of one of the pool URLs, but the pagination controls for gelbooru's pool- and tag pages seem to be missing. You can get to the next pool page by setting the pid parameter in the URL (the same mechanism is used on their posts-page): page2 page3 and so on.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jun 12, 2017

What made you check that? 25 inconceivably low for a big site? 😉
But you're right, of course. Incidentally, I found the pagination on Gelbooru! It was blocked by uBlock Origin, which I use on Chrome. Well, not just on Chrome, I use it wherever I can, actually.
That means that some entry in one of the filter lists breaks the site...
Edit:
Not sure what tag page exactly, but apart from pools pagination seems to work for me.

@mikf
Copy link
Owner

mikf commented Jun 12, 2017

This one: https://gelbooru.com/index.php?page=tags&s=list

AdblockPlus + filter list seems to be causing the same issue.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jun 12, 2017

Ah, okay. Yes, pagination also broken for me on that listing.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jun 15, 2017

Small suggestion:

Add a column to Supported Sites to indicate status of user authentication. Not sure, just a simple "Supported" Yes or No.
Or "Required", Yes or No. Or "Required", "Optional"...

By the way, does this case even exist currently?
We have extractors that require authentication: Pixiv, Exhentai, and some others
And other extractors that don't authenticate at all, right?

@mikf
Copy link
Owner

mikf commented Jun 15, 2017

Thanks for the suggestion -> done fb1904d

And yes, there are two modules with optional authentication: bato.to and exhentai.
bato.to only offers a very limited selection of manga chapters if you are not logged in, but it is still usable.
Exhentai tries to fall back to the e-hentai version of the site, which only works for some galleries, and original image downloads aren't available as well.
(I added the fallback mechanism for exhentai only after our discussion, btw. af56887)

@Hrxn
Copy link
Contributor Author

Hrxn commented Aug 26, 2017

Might be a good idea, I'll try that. They are probably not exactly in sync when doing their archiving, but that is not really a problem.

By the way, warosu seems to be an archive as well, does board-id/thread-id also match here?

@mikf
Copy link
Owner

mikf commented Aug 26, 2017

Seems that way. The /g/ threads on warosu even link to their counterparts on archived.moe and rbt.asia, and they all share the same thread-id.

@llelf
Copy link

llelf commented Oct 10, 2017

There’s a good deal of keyword∶value info, 👍 on that.
It’s a pity it all will vanish, because currently there’s no way to save it somewhere [xattrs!]
What do you all think?

@Hrxn
Copy link
Contributor Author

Hrxn commented Oct 12, 2017

Question regarding f3fbaa5

[reddit] allow users to override the API User-Agent

For setting extractor.reddit.user-agent
Maybe it's just me, but I think the rules they list contradict themselves a bit.
Not sure what they really expect. But I think the important part is not to pretend to be a browser.. 😄
Or lie in any other way..
So what would you suggest? Just follow the given example
Example: User-Agent: android:com.example.myredditapp:v1.2.3 (by /u/kemitche)
and emulate that a bit. basically?

@mikf
Copy link
Owner

mikf commented Oct 12, 2017

@llelf you can currently store any metadata in JSON format by passing -j and redirecting the output to a file, but that doesn't work very well and also doesn't download any images while doing that. Using xattrs seems like a good idea so I'll be looking into that.

@Hrxn they state that every user-agent string should look something like <platform>:<app ID>:<version string> (by /u/<reddit username>), which is currently set to Python:gallery-dl:0.8.4 (by /u/mikf1). Take this and replace gallery-dl with the name of your registered application and mikf1 with your own username, but just modifying the given example a bit should work as well.
I'm not sure how strict they are about all of this ("NEVER lie about your user-agent" is written in bold ...) but I wanted to avoid a situation were multiple "applications" use the same user-agent as gallery-dl and they block all of them.

@ezagarskaya
Copy link

ezagarskaya commented Nov 30, 2017

Guys, first thanks for a great project, it helps me a lot!

The question is:
How to add a delay between download requests?
My speed is too high, I am afraid safebooru will block me soon.

I have used
"safebooru":
{
"wait-min": 6,
"wait-max": 10,
"timeout": 30,
"filename": "{id}.{extension}"
},
but nothing changed

Please, help me

@mikf
Copy link
Owner

mikf commented Nov 30, 2017

There is currently no way to add a delay between downloads or limit download speeds, but I guess I will be looking into that next.

wait-min and wait-max are only available for exhentai and chan.sankakucomplex, because they would either actively block you or respond with "429 Too Many Requests" status codes if you didn't wait between requests to their sites, but so far these have been the only two were this was necessary (I doubt safebooru is going to block you).

In the meantime you could collect a few image URLs from safebooru by using -g and use another program that supports these features (aria2, wget, etc) to download them:

# get the first 500 image URLs and download them at 500kb/s, waiting 5s after each download
$ gallery-dl -g --range 1-500 "http://safebooru.org/..." > url_file
$ wget -i url_file --limit-rate=500k --wait=5

# get the next 500
$ gallery-dl -g --range 501-1000 "http://safebooru.org/..." > url_file
...

(The timeout option only works for the HTTP downloader and has a default value of 30, so settings it there doesn't do much)

@Hrxn
Copy link
Contributor Author

Hrxn commented Dec 7, 2017

@mikf Some extractors don't specify directory_fmt in their source (for example gfycat.py)

What to do? Manually setting another value for category?
I.e., this one: extractor.gfycat.category?
Because that is apparently the default directory that gets always used.
Or is it better to use this?
extractor.gfycat.directory

Which allows to use some sub-dirs, i.e. ["Gfy", "In", "Here"]?

@mikf
Copy link
Owner

mikf commented Dec 7, 2017

If you want to change an extractor's target directory, you should set it's directory value (here extractor.gfycat.directory).

(Extractor) classes will use the values specified in their base class if these aren't specified in the class itself, which in this case means that gfycat extractors are using the value set in the Extractor class (see Extractor.directory_fmt).
There is nothing special about not specifying a directory_fmt value. All it does is basically saving 1 line of code.

It is also not possible to overwrite an extractor's category. extractor.gfycat.category is not a value that gets recognized.

@Hrxn
Copy link
Contributor Author

Hrxn commented Dec 7, 2017

Thanks, got it.
Made some new targets for some directory prefs, can confirm, all seems to work fine! 😄

@Hrxn
Copy link
Contributor Author

Hrxn commented Dec 21, 2017

@mikf There's some unusual behaviour, although I don't think it's a real issue, maybe a cosmetic one, and I assume something like this is specific to Windows as well. I hope it's not too much of a nitpick, probably just a question of different ways to implement it in detail..

For each directory option (extractor.*.directory) we can set a list of strings to specify a target directory for the extraction process, where each string in this list results in its own path segment.
This happens by using Python format strings, and by virtue of Python's excellent cross-platform support (at least that's what they say, right?), defining a target directory like this:
["Extractor", "Example", "Subdir", "{title}"]
Will give us the following result:

  • On Windows:
    • \Extractor\Example\Subdir\{title}
  • On Unixoid OS:
    • /Extractor/Example/Subdir/(title)

But here's the thing: It does not work in the same way for the base directory.
Consider this as my value for base-directory in gallery-dl.conf:
"D:/Download/Pictures"
What happens now, when using the extractor from the example here, the output messages printed to the console window appear like this (again, Windows):

D:/Download/Pictures\Extractor\Example\Subdir\{title}\filename_id_1.ext
D:/Download/Pictures\Extractor\Example\Subdir\{title}\filename_id_2.ext
(and so on)

Alternatively, setting base-directory to this:
"D:\Download\Pictures"
Results in an error message, improperly escaped sequence etc. pp.
This is maybe not really a surprise, considering that \ is usually a standard escape character.
Understandably, setting base-directory to this:
"D:\\Download\\Pictures"
seems to work then, giving output messages like this:

D:\Download\Pictures\Extractor\Example\Subdir\{title}\filename_id_1.ext
D:\Download\Pictures\Extractor\Example\Subdir\{title}\filename_id_2.ext
(and so on)

Okay, so it appears that, and please correct me if my conclusion is wrong, the base-directory property does not utilize the same Python format string as the directory options. Is there any specific reason for that?
I'm not sure, but I just assumed that all parts rely on the same format string, which then gets joined together to the final output format string, and that is the end result we see.

I did a quick code search, I think this is the relevant result:

bdir = extractor.config("base-directory", (".", "gallery-dl"))
if not isinstance(bdir, str):
bdir = os.path.join(*bdir)
self.basedirectory = expand_path(bdir)

Or maybe these functions?

def build_path(self, sep=os.path.sep):

def set_directory(self, keywords):

@mikf
Copy link
Owner

mikf commented Dec 21, 2017

The value for base-directory is supposed to be just a static string that gets put in front of all paths generated during runtime. Its environment variables get expanded, but it doesn't go through any string formatting and its path separators (/, \) are left alone by os.path.join.

The full path gets build by something like
os.path.join(base_directory, format(segment1), format(segment2), ..., format(filename))
which concatenates all parts using either / or \ depending on your OS, but anything inside these parts stays the way it is. So if you put any forward slashes into your base directory, they will still be there afterwards.

You can actually use a list of strings as directory segments for base-directory, which will be joined with the "correct" slashes, but thanks to how os.path.join works, you would still have to manually put a slash after the drive letter: ["D:\\", "Download", "Pictures"]. So that doesn't really help ...

As for a reason why it it works the way it does: In the earlier versions of this project I wanted a way to direct all downloads to a common base-path which is how this option came to be; and it has stayed like this ever since. There is a static part + a dynamic part + a filename, which seems reasonable to me.

To solve your "slash" problem: I guess I could just replace all forward- with backward-slashes on Windows which should result in a consistent use of \ as path separator. (edit: d241a0f)

@Hrxn
Copy link
Contributor Author

Hrxn commented Dec 22, 2017

Interesting to know, thank you for the explanation.

In summary, we could say the true cause of this "issue" is the Python interpreter and its implementation itself, right? Depending on the OS, of course, but apparently the functionality of os.path etc. just takes any basic string and doesn't bother further. I assume that Python (on Windows) itself then uses some standard Windows API function for the output directory, and the Windows API doesn't care either about proper path separators, if I recall that correctly. In the end, I guess we can only speculate whether this is all a design decision or simply a small lapse. But okay, I digress..

Thank you anyway for addressing this very specific nuisance.. 😄

But with the latest commit, what is the one true way to write my gallery-dl.conf?
Or does it really matter, because the path separators now always get replaced, either way?

@mikf
Copy link
Owner

mikf commented Dec 22, 2017

we could say the true cause of this "issue" is the Python interpreter and its implementation itself, right?

Well, not really. The functionality is well documented, so I could have somehow worked around this, but I didn't realize that forward slashes in Windows could be an issue ... doesn't help that I'm not using Windows myself.

But with the latest commit, what is the one true way to write my gallery-dl.conf?

As you said yourself, it doesn't really matter. Both work (/ or \\), so just use what looks best.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 1, 2018

@mikf
Happy New Year! 🍾 🎆 🎇

If I may inquire, are you currently planning on adding support for some new sites? Or already something in the pipeline? Other plans in that regard?

Because I'd like to make a suggestion, basically, and maybe get some other opinions and feedback in here 😄

@mikf
Copy link
Owner

mikf commented Jan 1, 2018

Happy New Year to you, too.

There are no plans on adding support for new sites from my side, but I have been thinking about adding a few features - an equivalent of YoutubeDL's --download-archive and (maybe) a way of executing external processes after each image download (post-processing, writing metadata, etc.) - as well as finally adding some necessities like GitHub issue templates and a contributing guide.

If you have an idea or suggestion about improvements, (new) features, site support, etc., just open a new issue and let me know.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 2, 2018

I presume that something like --download-archive would be a useful feature, agreed. Good idea, actually.

Not sure if templates for GitHub are really that necessary, considering the rather low amount of opened issues. If the tracker gets flooded with new issues, this would be a different story. But if you think that the repository would feel like something's missing, for lack of a better description right now, don't mind my comment on this 😉

I will definitely open a new issue for a new site, but I wanted to gather some feedback first, and since this thread is already in existence [1], I thought it would be a good idea to simply ask first. Dunno, I would really like to see some other users chiming in here, but so far there aren't that many, unfortunately.

Okay, everyone reading this, please let me know: What do you think of adding support for ArtStation, for example?

[1]
Although I admit, I am not too happy about it. Because, technically, this is not a real issue, rather a "meta-issue", and this rubs my OCD in the wrong way, because it goes a bit against the principles of consistency and purity, and is kind of a conceptual issue in itself 😄
But I don't know what would work better instead right now. I think something like a #gallery-dl channel on IRC would be nice to have, and I would totally come and hang out there, but off-site solutions are usually less than ideal solutions.

Maybe this Projects feature on GitHub would be a good alternative?
This one here: https://github.com/mikf/gallery-dl/projects
Maybe some kind of Note can be opened, as a quick stop for any kind of discussion or something, not sure.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 6, 2018

Anyone? Please?

@Bfgeshka
Copy link

Bfgeshka commented Jan 6, 2018

Functions covered in GH projects and issue tracker are virtually the same. Only important factor here is personal preference of main maintainer. I think that common tracker is much more straightforward.

@mikf
Copy link
Owner

mikf commented Jan 6, 2018

The projects page doesn't seem particularly suited to fulfill a similar role as this meta-issue here does. Having an issue for general discussion is a lot more accessible/visible then a "meta-project" on the projects page and, as Bfgeshka said, much more straightforward for the average user.

But you are right, there should probably be another way and place for general questions and discussion. An IRC channel (on freenode?) would nice and all but it would most likely require some sort of logging bot to be useful. Another alternative might be Gitter, which is used by quite a few other GitHub projects. I've played around with it a bit and registered a "community" and room there: https://gitter.im/gallery-dl/main . Maybe that is something to use.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 7, 2018

This Gitter thing is pretty nice.. especially the integration with GitHub, definite advantage over a normal vanilla IRC channel.

As I understand it, the projects feature offers better visualization and organization of all related matters, in the form of boards, kanban style. I personally like these, but it might take some time to get used to it for any novices, and at the current state of the project in general, primarily activity, it might be a bit overkill right now. And you are right, accessibility and visibility should be the main concern here. I mean, any board/notes whatever in Project can be mentioned (and linked) in README.md, thus appearing directly on the "front" page, but on the other hand, the majority of users on GitHub is already familiar with the Issues tab, and that is therefore the place where they go/search first, I assume.

In the meantime, the meta-issue is definitely fine with me, no complaints here. Although on my end, not sure if you are affected as well, I can notice a small delay when opening this issue, it's not slow or anything, but noticeable, in my opinion. And as #11 here continues to grow, I guess at some point we'd have to close it and open a new one 😄

But okay, I think we're already in bike-shedding territory here.
So, what do you think of ArtStation: 👍 or 👎

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jan 22, 2018

@mikf can you recommend a way to cache the result of the extractor?

  1. can you explain the message type on https://github.com/mikf/gallery-dl/blob/master/gallery_dl/extractor/message.py? how the keyword should be? how does gallery-dl handle each type of message?

  2. i try the gist you write

j = job.UrlJob("http://example.org/image")
j.run()  # prints "http://example.org/img.jpg"
print(j.extractor)

this will take a long time as example link of reddit thread, where it will find another links and extract it directly. so i'm trying a custom `UrlJob', which handle message with type Message.queue as Message.Url.

class CustomUrlJob(job.UrlJob):

    def run(self):
        try:
            log = self.extractor.log
            for msg in self.extractor:
                if msg[0] == Message.Queue:
                    _, url, kwds = msg
                    self.update_kwdict(kwds)
                    self.handle_url(url, kwds)
                else:
                    self.dispatch(msg)
                ...

is there better way to do it?

@mikf
Copy link
Owner

mikf commented Jan 22, 2018

Caching extractor results (and a bit more) is what the DataJob class does, but you can have this a lot easier than that.
Extractor results are just tuples where the first element is one of these message-type identifiers from message.py which determines the type and meaning of the other elements.

  • Message.Version:
    • has currently no use, just ignore it.
  • Message.Directory:
    • sets the target directory for all following images
    • 2nd element is a dictionary containing general metadata
  • Message.Url:
    • image URL and its metadata
    • 2nd element is the URL as a string
    • 3rd element is a dictionary with image-specific metadata
  • Message.Urllist:
    • same as Message.Url, but its 2. element is a list of multiple URLs (it is easier to have this as a seperate message, if you are wondering)
  • Message.Queue:
    • (external) URL that should be handled by another extractor. The name "Queue" is a bit weird/misleading, but older versions implemented this with an actual queue.
    • 2nd element is the (external) URL
    • 3rd element is a dictionary that may contain metadata

To just copy all of these tuples for later use, try this: https://gist.github.com/mikf/052916c25a9bda7d6876a355cacbe88f

And the UrlJob thing is a bit of a mistake on my part and will be fixed in one of the next commits. For the time being, set UrlJob.maxdepth to 1 and it should pass Queue messages to its handle_url() method.

edit: updated the gist code to use extend() instead of append()

@mikf
Copy link
Owner

mikf commented Jan 22, 2018

@Hrxn: before I forget, I'm also noticing a considerable delay when opening this issue, so closing this and creating a new one might be in order.
ArtStation gets a 👍 from me, but I would like to have this a separate issue with example URLs and all that.

@Hrxn
Copy link
Contributor Author

Hrxn commented Jan 22, 2018

Roger that, closing this and opening issue for new site soon.
👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants