Add cookie support for all rippers #1483
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Category
This change is exactly one of the following:
Description
Many resources could be downloaded by ripme, but are hidden behind a login. Users that have accounts with these services can access the content, but ripme can't.
Since most websites use cookies to store the logged in state in the browser, we can use them to log in. We need to use the cookies the user has in the browser and send them with our requests.
How to use
(this should also be a wiki entry if accepted)
To use the cookies feature, the cookies need to be supplied in the config file, in a line like this:
cookies.host = key1=val1; key2=val2
.To get them, go to the website for which you want to get cookies, e.g. https://reddit.com.
Open the browser console (e.g. using F12; right click & inspect element, then going to Console) and type in / paste:
Add the output after your config entry, like this:
The whole config entry can also be generated using this snippet in the console:
Example:
cookies.reddit.com = reddit_session=<value>; other_cookie=<value>
This matches all subdomains for reddit.com, such as www.reddit.com, old.reddit.com, new.reddit.com, np.reddit.com etc.
Examples
The old ripme version cannot rip quarantined or private subreddits.
Take for example this url:
https://www.reddit.com/r/waterniggas/
.After setting
cookies.reddit.com
, Ripme is able to fetch the content and correctly downloads all images.Testing
Required verification:
mvn test
(there are no new failures or errors).Optional but recommended:
This closes #1245 and has to do with (but doesn't necessarily close) #1273, #1438, #1333, #961 (Setting cookies for twitter.com did not work, still getting 401 error).
TODO
There are two problems that might occur:
While this change probably doesn't overwrite cookies that are already added to the
connection
(sinceconnection.cookies()
"Adds each of the supplied cookies to the request"; doesn't talk about overwriting), more manual testing needs to be done, especially with the following rippers (since they already use cookies):You can use the following jar I built to test these changes:
ripme-cookies-prerelease-jar-with-dependencies.zip
Another change that might be possible is adding an error message when the status code is 401 or 403 that tells the user about this feature.
The current error message looks like this:
Failed to load https://www.reddit.com/r/waterniggas.json after 1 attempts
, so this might need some work.