Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions, Feedback and Suggestions #3 #146

Closed
mikf opened this issue Jan 1, 2019 · 270 comments
Closed

Questions, Feedback and Suggestions #3 #146

mikf opened this issue Jan 1, 2019 · 270 comments

Comments

@mikf
Copy link
Owner

mikf commented Jan 1, 2019

Continuation of the old issue as a central place for any sort of question or suggestion not deserving their own separate issue. There is also https://gitter.im/gallery-dl/main if that seems more appropriate.

Links to older issues: #11, #74

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jan 1, 2019

simple snippet to turn gallery-dl into api

from types import SimpleNamespace
from unittest.mock import patch, Mock
import os

import click
from flask.cli import FlaskGroup
from flask import (
    Flask,
    jsonify,
    request,
)

from gallery_dl import main, option
from gallery_dl.job import DataJob

def get_json():
    data = None
    parser = option.build_parser()
    args = parser.parse_args()
    args.urls = request.args.getlist('url')
    if not args.urls:
        return jsonify({'error': 'No url(s)'})
    args.list_data = True

    class CustomClass:
        data = []

        def run(self):
            dj = DataJob(*self.data_job_args, **self.data_job_kwargs)
            dj.run()
            self.data.append({
                'args': self.data_job_args,
                "kwargs": self.data_job_kwargs,
                'data': dj.data
            })

        def DataJob(self, *args, **kwargs):
            self.data_job_args = args
            self.data_job_kwargs = kwargs
            retval = SimpleNamespace()
            retval.run = self.run
            return retval

    c1 = CustomClass()
    with patch('gallery_dl.option.build_parser') as m_bp, \
            patch('gallery_dl.job.DataJob', side_effect=c1.DataJob) as m_jt:
        #  m_option.return_value.parser_args.return_value = args
        m_bp.return_value.parse_args.return_value = args
        m_jt.__name__ = 'DataJob'
        main()
        data = c1.data
    return jsonify({'data': data, 'urls': args.urls})

def create_app(script_info=None):
    """create app."""
    app = Flask(__name__)
    app.add_url_rule(
        '/api/json', 'gallery_dl_json', get_json)
    return app


@click.group(cls=FlaskGroup, create_app=create_app)
def cli():
    """This is a script for application."""
    pass


if __name__ == '__main__':
    cli()

e: this could be simple when using direct DataJob to handle the urls, but i haven't check if there is anything have to be done before initialize DataJob instance

@mikf
Copy link
Owner Author

mikf commented Jan 3, 2019

this could be simple when using direct DataJob to handle the urls, but i haven't check if there is anything have to be done before initialize DataJob instance.

You don't need to do anything before initializing any of the Job classes:

>>> from gallery_dl import job
>>> j = job.DataJob("https://imgur.com/0gybAXR")
>>> j.run()
[ ... ]

You can initialize anything logging related if you want logging output,
or call config.load() and config.set(...) if you want to load a config file and set some custom options,
but none of that is necessary.

@DonaldTsang
Copy link

@rachmadaniHaryono what does that code do?

@rachmadaniHaryono
Copy link
Contributor

simpler api (based on above suggestion)

#!/usr/bin/env python
from types import SimpleNamespace
from unittest.mock import patch, Mock
import os

import click
from flask.cli import FlaskGroup
from flask import (
    Flask,
    jsonify,
    request,
)

from gallery_dl import main, option
from gallery_dl.job import DataJob
from gallery_dl.exception import NoExtractorError


def get_json():
    data = []
    parser = option.build_parser()
    args = parser.parse_args()
    args.urls = request.args.getlist('url')
    if not args.urls:
        return jsonify({'error': 'No url(s)'})
    args.list_data = True
    for url in args.urls:
        url_res = None
        error = None
        try:
            job = DataJob(url)
            job.run()
            url_res = job.data
        except NoExtractorError as err:
            error = err
        data_item = [url, url_res, {'error': str(error) if error else None}]
        data.append(data_item)
    return jsonify({'data': data, 'urls': args.urls})


def create_app(script_info=None):
    """create app."""
    app = Flask(__name__)
    app.add_url_rule(
        '/api/json', 'gallery_dl_json', get_json)
    return app


@click.group(cls=FlaskGroup, create_app=create_app)
def cli():
    """This is a script for application."""
    pass


if __name__ == '__main__':
    cli()

@rachmadaniHaryono
Copy link
Contributor

gallery_dl_gug
gug for hydrus (port 5013)

@DonaldTsang
Copy link

@rachmadaniHaryono instructions on using this GUG and combing it with Hydrus? Any pre-configurstions besides pip3 install gallery-dl ?

@rachmadaniHaryono
Copy link
Contributor

  • put that on script (e.g. script.py)
  • import gug into hydrus
  • pip3 install flask gallery-dl (add --user if needed)
  • run python3 script.py --port 5013

@DonaldTsang
Copy link

@rachmadaniHaryono add that to the Wiki in https://github.com/CuddleBear92/Hydrus-Presets-and-Scripts if you can, sounded like a really good solution. Also, why port 5013, is that port specifically used for something?

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jan 9, 2019

Also, why port 5013, is that port specifically used for something

not a really technical reason. i just use it because the default port is used for my other program.

add that to the Wiki in CuddleBear92/Hydrus-Presets-and-Scripts if you can

i will consider it, because i'm not sure where to put that

another plan is fork (or create pr) for server command but i'm not sure if @mikf want pr for this

@DonaldTsang
Copy link

@rachmadaniHaryono https://github.com/CuddleBear92/Hydrus-Presets-and-Scripts/wiki
Also I would like @mikf to have a look at this, since this is pretty useful.
BTW, what is the speed overhead of using this over having a separate txt file like the one in Bionus/imgbrd-grabber#1492 ?

@rachmadaniHaryono
Copy link
Contributor

BTW, what is the speed overhead of using this over having a separate txt file like the one in Bionus/imgbrd-grabber#1492 ?

this depend on hydrus vs imgbrd-grabber download speed. from my test gallery-dl give direct link, so hydrus don't have to process the link anymore.

@mikf
Copy link
Owner Author

mikf commented Jan 10, 2019

another plan is fork (or create pr) for server command but i'm not sure if @mikf want pr for this

I've already had something similar to this in mind (implementing a (local) server infrastructure to (remotely) send commands / queries: gallery-dl --server), so I would be quite in favor of adding functionality like this.
But I'm not so happy about adding flask as a dependency, even if optional. I just generally dislike adding dependencies if they aren't absolutely necessary. I was thinking of using stuff from the http.server module in Python's standard library if possible.
Also: the script you posted here should be simplified quite a bit further. For example there is no need to build an command line option parser. I'll see if I can get something to work on my own.

A few questions from me concerning Hydrus

  • The whole thing is written in Python, even version 3 since the last update. Isn't there a better way of coupling it with another Python module than a HTTP server? As in "is it possible to add a native "hook" to make it call another Python function"?
  • Is there any documentation for the request and response data formats Hydrus sends to and expects from GUG's? I've found this, but that doesn't really explain how Hydrus interacts with other things.

@rachmadaniHaryono
Copy link
Contributor

But I'm not so happy about adding flask as a dependency, even if optional. I just generally dislike adding dependencies if they aren't absolutely necessary. I was thinking of using stuff from the http.server module in Python's standard library if possible.

this still depend on how big will this be. will it just be an api or there will be html interface for this. although an existing framework will make it easier and the plugin for the framework will let other developer create more feature they want.

of course there is more better framework than flask as example, e.g. sanic, django but i actually doubt if using the standard will be better than those.

Also: the script you posted here should be simplified quite a bit further. For example there is no need to build an command line option parser.

that is modified version from flask cli example. flask can do that simpler but it require to set up variable environment which add another command

The whole thing is written in Python, even version 3 since the last update. Isn't there a better way of coupling it with another Python module than a HTTP server? As in "is it possible to add a native "hook" to make it call another Python function"?

hydrus dev is planned to make api for this on the next milestone. there is also other hydrus user which make unofficial api but he didn't make one for download yet. so either wait for it or use existing hydrus parser

Is there any documentation for the request and response data formats Hydrus sends to and expects from GUG's? I've found this, but that doesn't really explain how Hydrus interacts with other things.

hydrus expect either html and json and try to extract data based on the parser the user made/import. i make this one for html but it maybe changed on future version https://github.com/CuddleBear92/Hydrus-Presets-and-Scripts/blob/master/guide/create_parser_furaffinity.md .

if someone want to make one, they can try made api similar to 4chan api,copy the structure and use modified parser from existing 4chan api.

my best recommendation is to try hydrus parser directly and see what option is there. ask hydrus discord channel if anything is unclear

@wankio
Copy link
Contributor

wankio commented Jan 11, 2019

can gallery-dl support weibo ? i found this https://github.com/nondanee/weiboPicDownloader but it take too long to scan and dont have ability to skip downloaded files

@mikf
Copy link
Owner Author

mikf commented Jan 13, 2019

@rachmadaniHaryono I opened a new branch for API server related stuff. The first commit there implements the same functionality as your script, but without external dependencies. Go take a look at it if you want.

And when I said your script "should be simplified ... further" I didn't mean it should use less lines of code, but less resources in term of CPU and memory. Python might not be the right language to use when caring about things like that, but there is still no need to call functions that effectively do nothing - command-line argument parsing for example.

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jan 13, 2019

will it be only api or will there will be html interface @mikf?

e: i will comment the code on the commit

@mikf
Copy link
Owner Author

mikf commented Jan 13, 2019

I don't think there should be an HTML interface directly inside of gallery-dl. I would prefer it to have a separate front-end (HTML or whatever) communicating with the API back-end that's baked into gallery-dl itself. It is a more general approach and would allow for any programing language and framework to more easily interact with gallery-dl, not just Python.

@rachmadaniHaryono
Copy link
Contributor

rachmadaniHaryono commented Jan 13, 2019

gallery_dl_gug

  • based on 8662e72
  • album.title is now parsed as album tag
  • source url and download url are minimum 2 character (fix host:port/api/json/1 error)
  • description is not None or none

still on port 5013

e: related issue CuddleBear92/Hydrus-Presets-and-Scripts#69

@wankio
Copy link
Contributor

wankio commented Feb 1, 2019

About twitter extractor, we have limited request depend on how many tweets user had right ?
if user have over 2k+ media, 99% it can't download full media

@mikf
Copy link
Owner Author

mikf commented Feb 2, 2019

@wankio The Twitter extractor gets the same tweets you would get by visiting a timeline in your browser and scrolling down until no more tweets get dynamically loaded. I don't know how many tweets you can access like that, but Twitter's public API has a similar restriction::

https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html

This method can only return up to 3,200 of a user's most recent Tweets. Native retweets of other statuses by the user is included in this total, regardless of whether include_rts is set to false when requesting this resource.

You could try ripme. It uses the public API instead of a "hidden", browser-only API like gallery-dl. Maybe you can get more results with that.

@wankio
Copy link
Contributor

wankio commented Feb 3, 2019

but if i remember, ripme rip all tweet/retweet not just user tweet

@schleifen
Copy link

For some reason the login with OAuth and App Garden tokens or the -u/-p commands doesn't work with flickr which makes images that require a login to view them not downloadable. But otherwise amazing tool, thank you so much!

@wankio
Copy link
Contributor

wankio commented Feb 24, 2019

today when i'm checking e-hentai/exhentai, it just stucked forever. maybe my ISP is the problem because i can't access e-hentai but exhentai still ok. So i think Oauth should help, using cookies instead id+password to bypass

@ghost
Copy link

ghost commented Apr 10, 2019

is there a way to download files directly in a specified folder instread of subfolders?
for exemple for the picture to be downloaded in F:\Downloaded\ i tried using
gallery-dl -d "F:\Downloaded" https://imgur.com/a/xcEl2WW
but instead they get downloaded to F:\Downloaded\imgur\xcEl2WW - Inklings
is there an argument i could add to the command to fix that?

@mikf
Copy link
Owner Author

mikf commented Jan 26, 2024

@tddschn You can manually specify the browser profile folder path when the defaults don't work.

--cookies-from-browser "chrome:C:\path\to\profile"

@github-userx
Copy link

github-userx commented Feb 1, 2024 via email

@github-userx
Copy link

github-userx commented Feb 1, 2024 via email

@AdamSaketume26
Copy link

could someone please help me with making instagram download stories, reels, posts to their own files using command line?

@AdamSaketume26
Copy link

i notice that even when gallery-dl says i have ratelimit i can still use twitter website and go to the users profile and view their tweets if i click replies or media tab. but when i change timeline strategy to media and i change twitter include to "media" it doesnt work and i am still rate limit. why is there ratelimit for gallerydl but not on website media tab?

@biggestsonicfan
Copy link

biggestsonicfan commented Feb 4, 2024

Can you pass the cookies that gallery-dl is currently using to a post-processor?

bradenhilton pushed a commit to bradenhilton/gallery-dl that referenced this issue Feb 5, 2024
@WhyEssEff
Copy link

WhyEssEff commented Feb 6, 2024

Could we have a config option to sleep the extractor for a set amount of time upon encountering a 429 Too Many Requests error and retry with base delay before it goes into the delay interval increase routine?

I'm wondering for larger image repositories (my use case in this instance is DeviantArt, I'm downloading collections) if I just slept it for five/ten minutes or so and continued as normal afterwards if it might be faster than getting stuck in 17s delay purgatory. It's currently effectively what I'm doing when I interrupt the process when it gets too egregious and then attempt again 10 minutes later, and it seems to work as usual upon start, I just want to be able to continue where I left off.

@WhyEssEff
Copy link

WhyEssEff commented Feb 6, 2024

@biggestsonicfan I'd prefer the behavior I'm trying to get at to happen specifically on encountering an error. I'd like to assume minimum request time when possible, while telling the extractor to halt for x seconds if it throws back a 429, in order to see if it can just restart comfortably on the default delay after just not doing anything for x amount of time.

e.g., assume 0.5 second sleep-request until 429 is thrown, pause the extractor for 120 seconds, retry with default delay, then if it's still throwing 429s assume current behavior of increasing delay interval by 1s and trying again until it works

what this could look like would be something akin to the following:
[warning] API responded with 429 Too Many Requests. Pausing extractor for {configured-amount}s.

and then it could retry with default delay, upon which if it still fails it increases delay interval.

I'm wondering this because the longer delays sort of rack up runtime cumulatively and it might be more optimal for larger galleries to have this option, even if you have to set it to like 5/10 minutes to use it effectively.

@throwaway242685
Copy link

throwaway242685 commented Feb 9, 2024

hi, is there a way to download all the saved posts on my Instagram account with gallery-dl?

like, all of them at once.

@Noonereallycomeon
Copy link

Noonereallycomeon commented Feb 15, 2024

@useless642
you can get the link from Instagram on web browser it's "https://www.instagram.com/{username}/saved/all-posts/"
there's also separate links for each collection if you want to download them separately.

@mikf
Copy link
Owner Author

mikf commented Feb 16, 2024

@github-userx gofile will be fixed in the next release: 3433481

@biggestsonicfan Enable metadata-extractor and use the cookies property of the inserted object. cookies is a CookieJar object though.

@taskhawk
Copy link

How often are these QFS issues rotated? This one is getting kinda long.

@throwaway242685
Copy link

throwaway242685 commented Feb 22, 2024

hi, does gallery-dl have a parameter to update it to the master on GitHub?

yt-dlp has the following command:

yt-dlp --update-to master

does gallery-dl have something similar? if not, could this be implemented? 🙏

@fireattack
Copy link
Contributor

You can just run python3 -m pip install -U -I --no-deps --no-cache-dir https://github.com/mikf/gallery-dl/archive/master.tar.gz

@BakedCookie
Copy link

metadata.event has both file and skip. Is there a way to combine them? I'd like to save metadata on file download, but I'd also like to update the metadata for any skipped files. This is what my extractor looks like:

"extractor": {
        "base-directory": "./",
        
        "skip": "abort:100",
        
        "postprocessors": [
            {
                "name": "metadata",
                "mode": "json",
                "event": "skip",
                "directory": ".meta"
            }
        ],

@mikf
Copy link
Owner Author

mikf commented Feb 28, 2024

@BakedCookie
"event": ["file", "skip"] or "event": "file,skip"

(This isn't properly documented for some reason, while other options with similar semantics like include are.)

@biggestsonicfan
Copy link

Can gallery-dl cache cookies grabbed from the browser for a duration if you grab cookies from a browser? I'm noticing startup takes a while per-use and if I use cookies from a file, it's instant.

@mikf
Copy link
Owner Author

mikf commented Feb 29, 2024

@biggestsonicfan
It keeps them in memory while it is running.

To improve startup time, you could use BROWSER/.DOMAIN (e.g. firefox/.instagram) as --cookies-from-browser argument to only extract cookies for that domain instead of all; or you write them to a cookies.txt file with something like

gallery-dl --cookies-from-browser firefox --cookies-export cookies.txt --no-download http://example.org/a.jpg

and then load them from there.

@JinEnMok
Copy link

I noticed that per #80 there was some talk about Collections, but they still aren't implemented. They probably aren't that different from albums (see e.g. https://www.artstation.com/gallifreyan/collections/197428), so probably (?) wouldn't be that hard to implement.

Should I open an issue for this?

@biggestsonicfan
Copy link

To improve startup time, you could use BROWSER/.DOMAIN (e.g. firefox/.instagram) as --cookies-from-browser argument to only extract cookies for that domain instead of all; or you write them to a cookies.txt file with something like

I used to do cookies.txt on a per-site basis but it got a little tedious to manage. I already do

"cookies": ["firefox", "/firefox/profile/path", null, null, ".deviantart.com"],

if that's what you meant. I will try something like "cookies": ["firefox/.patreon", "/firefox/profile/path", null, null, ".patreon.com"], though to test.

Ah: patreon: cookies: Unsupported browser 'firefox/.patreon'

Moving on from that though, I would like to contribute a new site support for gallery-dl but other than browsing the existing code, I don't really see any templates for both an extractor and a test suite? Where would I start for a site that has embedded JSON in it's html page of the page's contents?

@mikf
Copy link
Owner Author

mikf commented Mar 1, 2024

@JinEnMok
Will be supported in the next batch of commits I git push to GitHub.
edit: cf9e99c

@biggestsonicfan
Yeah, that's what I was referring to. There isn't much you can do then, except the cookies.txt file thing. Configurable caching behavior is not implemented yet. (I will be eventually)

I don't really see any templates for both an extractor and a test suite

There isn't any. You could take a look at merged PRs that add support for a new site.

Where would I start for a site that has embedded JSON in it's html page of the page's contents?

text.extr(…) the JSON part and util.json_loads(…) it. (example)

@JinEnMok
Copy link

JinEnMok commented Mar 1, 2024

@mikf you're the best, cheers! :)

@mikf
Copy link
Owner Author

mikf commented Mar 1, 2024

Closing this as suggested by taskhawk (#146 (comment)).
New issue: #5262.

@mikf mikf closed this as completed Mar 1, 2024
@mikf mikf unpinned this issue Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests