Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Reddit - Downloading Imgur/Gfycat/Redgif hosted posts to the specific subreddit or user directory #1364

Closed
sourmilk01 opened this issue Mar 8, 2021 · 14 comments

Comments

@sourmilk01
Copy link

I'm having trouble setting up my config to save Reddit posts as I'd like: having all posts within a subreddit or user page save to the proper directory for the subreddit or user folder. This works fine for reddit-hosted images but obviously the extractor pulls out imgur, gfycat, and redgif hosted images and mp4's to their own seperate directories outside of the reddit directory.

What I'm trying to figure out is to have all the files associated with the queued subreddit or user save to the respective folder inside the reddit directory, regardless of if its reddit/imgur/redgif/etc, and for the file names to use the FileName convention I have set up for the reddit Extractor. Can anyone help me with setting this up in the config?

@sourmilk01
Copy link
Author

And with regards to scraping user page posts, if it would be possible to have the posts on the user page, regardless of the subreddit, saved to the same queued user directory.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 8, 2021

You should try to use the category-transfer option set for the Reddit extractor, like here.

For user specific extraction, use the subcategories of the Reddit extractor, in this case subreddit and user, like this:

(Adapted from the example config file in /docs:

"reddit":
        {
            "subreddit": {
                "directory":  ["{category}", "MySubreddits", "{subreddit}"],
                "filename": "{id}{num:? //>02} {title[:220]}.{extension}"
            },
            "user": {
                "directory":  ["{category}", "MyUserfollows", "{subreddit}"],
                "filename": "{id}{num:? //>02} {title[:220]}.{extension}"
            }, 
            "comments": 0,
            "morecomments": false,
            "date-min": 0,
            "date-max": 253402210800,
            "date-format": "%Y-%m-%dT%H:%M:%S",
            "id-min": "0",
            "id-max": "zik0zj",
            "recursion": 0,
            "videos": true,
            "user-agent": "Python:gallery-dl:0.8.4 (by /u/mikf1)",
            "category-transfer": true
        },

Filename settings are left like in the example, and I'm not entirely sure if {subreddit} is the correct field name that gives you the user name, though it might be possible. Can't test this here right now, you can check for yourself with gallery-dl -K <URL>

@mikf
Copy link
Owner

mikf commented Mar 8, 2021

category-transfer

That probably won't work here.

Try enabling parent-directory to have all imgur, gfycat, and redgif files be put inside the Reddit directory.

@sourmilk01
Copy link
Author

@mikf

How would I set up parent-directory in the config, would it be true just inside the Reddit extractor section or imgur/gfycat/redgif too?

Just in case anyone is familiar, I am trying to set up the scraper so it performs similarly to the Rip-Me application (https://github.com/RipMeApp/ripme), where a queued subreddit or user gallery pulls all of the images or mp4's to a singular folder.

@mikf
Copy link
Owner

mikf commented Mar 8, 2021

How would I set up parent-directory in the config, would it be true just inside the Reddit extractor section or imgur/gfycat/redgif too?

Yep, just like "category-transfer": true in the example above, only with a different name.

to a singular folder.

You might have to change the directory settings for imgur etc to an empty list then ("directory": []), because those'll get their own subdirectory otherwise. ./reddit/SUBREDDIT/imgur/filename.ext or example.

@sourmilk01
Copy link
Author

sourmilk01 commented Mar 8, 2021

I believe it is working like intended now, I think one of my issues was having a directory for subreddit and user in the same config; right now I have a separate config for each that I'll switch out when needed.

        "reddit":
        {
	    "parent-directory": true,
            "directory":  ["reddit", "_u_{author}"],
	    "filename": "{subreddit}_{author}_{title}_{id}_{num}_{filename}_{date}.{extension}",
            "comments": 0,
            "morecomments": false,
            "date-min": 0,
            "date-max": 253402210800,
            "date-format": "%Y-%m-%dT%H:%M:%S",
            "id-min": "0",
            "id-max": "zik0zj",
            "recursion": 0,
            "videos": true,
            "user-agent": "Python:gallery-dl:0.8.4 (by /u/mikf1)",
            "postprocessors": [{
                "name": "metadata",
                "mode": "json",
                "directory"       : "Metadata",
                "extension-format": "metadata.txt"
            }]
        },

My last issue or question is if there is any way to force the imgur/gfycat/redgif posts on a subreddit or user gallery to use the FileName convention of their host reddit post. Right now it is spitting the files out with the default "imgur_23aj435_title" filename, but I would prefer if it would retain the information from the reddit post, like with reddit-hosted images.

@Hrxn
Copy link
Contributor

Hrxn commented Mar 8, 2021

Not sure if switching out configs is really necessary, but if works for you, why not..

The filename setting for Imgur/Gfycat/Redgifs etc is the same as when using one of these sites itself, you can therefore customize them accordingly in their respective sections in the config file. In terms of retaining information though, you only have the metadata available as provided by the hosting site, obviously.

@sourmilk01
Copy link
Author

sourmilk01 commented Mar 8, 2021

Not sure if switching out configs is really necessary, but if works for you, why not..

Is there a way to set up the config so the directory is conditional depending on whether its a user gallery or subreddit?

Regarding the filename, at the time being I have the Imgur/Gfycat/Redgif using my Reddit filename convention, just for conformity's sake. The issue is, like you said, the metadata for those sites is separate and different from reddit and the filename will have pieces missing if trying to use keywords from reddit.

I guess what I'm asking is if it is possible to have reddit posts whose media is hosted on Imgur/Gfycat/Redgif have the scraper pull metadata from the reddit post itself rather than from the host website.

For example, this post, whose media is hosted on imgur, would have the metadata (and thus keywords for filenames) scraped from the reddit post rather than the actual imgur link where the image is (https://i.imgur.com/ZSTidlZ.jpg).

@sourmilk01
Copy link
Author

@mikf , I'm guessing setting up the scraper to name the files how I described is currently not possible.

Could you tag this thread with "feature-request"?

mikf added a commit that referenced this issue Mar 12, 2021
experimental, might not work as expected, etc.
@mikf
Copy link
Owner

mikf commented Mar 12, 2021

@sourmilk01 could you try the parent-metadata option from df94182? It overwrites the metadata dict of child extractors (imgur, gfycat, etc) with data from the parent (reddit) and should allow to use the same filename format string for all of them.

edit: there are open issues with a somewhat similar problem as this one, by the way: #637, #827

@sourmilk01
Copy link
Author

@mikf, thank you for the commit, I didn't see it. I've tested the parent-metadata option and it appears to be working exactly as intended/requested. As for the other open issues, I suppose this change fixes them; I often can't think of the right terms or words when searching for other issues so I might have created an extra/unnecessary thread, apologies.

Thanks again for this great tool and for updating it so often. You rock! I was going to ask if there was a way we could donate or 'tip' in any way, but I saw #347 . Otherwise, I'd offer to help in any other way but I have no background in coding :/

@sourmilk01
Copy link
Author

sourmilk01 commented Mar 13, 2021

Oh, in case anyone else is reading this and trying to set up parent-metadata, your FileName option for imgur/gfycat/redgif must be set to be the same as your reddit FileName option; otherwise it will just be using the default reddit FileName.

@sourmilk01
Copy link
Author

@mikf , I may have found a potential 'gap' in posts being scraped using parent-metadata, albeit very minor.

The option does not appear to apply to reddit posts that whose original gfycat host is now redirected and hosted on redgifs using the gifdeliverynetwork domain.

Example subreddit and post I was scraping (NSFW):
https://www.reddit.com/r/hopelesssofrantic/
https://www.reddit.com/r/hopelesssofrantic/comments/drbgeg/the_yoga_abs_are_coming_in/
https://www.gfycat.com/dirtydearestgenet
https://www.gifdeliverynetwork.com/dirtydearestgenet

I already have a reddit-metadata FileName set up in the config for gfycat and redgif, and I even tried setting up a section for "gifdeliverynetwork" using the same format, but files hosted this way are being downloaded with empty metadata fields ("None"). The posts are still being saved to the directory, so parent-directory does not appear to be affected.

Again, very minor and it only appears to apply for very few posts, but I thought I'd let you know.

mikf added a commit that referenced this issue Mar 14, 2021
Allow forwarding metadata from the top-level extractor to all children
if 'parent-directory' is enabled for all extractors along the way.

For example 'reddit' -> 'gfycat' -> 'redgifs'
@mikf
Copy link
Owner

mikf commented Mar 14, 2021

@sourmilk01 should be fixed in 2364174 as long as you enable parent-metadata for all relevant extractors (reddit and gfycat in this case), or globally:

$ gallery-dl -o filename="{subreddit}_{author}_{title}_{id}_{num}_{filename}_{date}.{extension}" -o directory= -o parent-directory=1 -o parent-metadata=1 https://www.reddit.com/r/hopelesssofrantic/comments/drbgeg/the_yoga_abs_are_coming_in/
/tmp/hopelesssofrantic_hopelesssofrant…in_drbgeg_0_None_2019-11-04 02:59:45.mp4

Also, it seems that gfycat/redgifs doesn't provide a filename. You could change {filename} to {filename|gfyName} to have an alternative for those sites.

I even tried setting up a section for "gifdeliverynetwork"

gifdeliverynetwork is covered by redgifs extractors.

thank you for the commit, I didn't see it.

I usually don't immediately push commits to GitHub. In this case it only got pushed a minute or so before I left that comment, so there was no way for you to see it before that.

As for the other open issues

I mentioned them only in case something posted in them would be helpful here, and because the parent-metadata option might be "related" to those problems in some way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants