Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cookie support for all rippers #1483

Merged
merged 5 commits into from
Dec 3, 2019
Merged

Conversation

xarantolus
Copy link
Contributor

@xarantolus xarantolus commented Nov 10, 2019

Category

This change is exactly one of the following:

  • a bug fix (Fix #...)
  • a new Ripper
  • a refactoring
  • a style change/fix
  • a new feature

Description

Many resources could be downloaded by ripme, but are hidden behind a login. Users that have accounts with these services can access the content, but ripme can't.

Since most websites use cookies to store the logged in state in the browser, we can use them to log in. We need to use the cookies the user has in the browser and send them with our requests.

How to use

(this should also be a wiki entry if accepted)

To use the cookies feature, the cookies need to be supplied in the config file, in a line like this: cookies.host = key1=val1; key2=val2.

To get them, go to the website for which you want to get cookies, e.g. https://reddit.com.

Open the browser console (e.g. using F12; right click & inspect element, then going to Console) and type in / paste:

console.log(document.cookie)

Add the output after your config entry, like this:

cookies.reddit.com = key1=val1; key2=val2

The whole config entry can also be generated using this snippet in the console:

console.log("cookies." + window.location.hostname.replace(/^www\./, "") + " = " + document.cookie)

Example:

cookies.reddit.com = reddit_session=<value>; other_cookie=<value>

This matches all subdomains for reddit.com, such as www.reddit.com, old.reddit.com, new.reddit.com, np.reddit.com etc.

Examples

The old ripme version cannot rip quarantined or private subreddits.
Take for example this url: https://www.reddit.com/r/waterniggas/.
After setting cookies.reddit.com, Ripme is able to fetch the content and correctly downloads all images.

Testing

Required verification:

  • I've verified that there are no regressions in mvn test (there are no new failures or errors).
  • I've verified that this change works as intended.
    • Downloads all relevant content.
    • Downloads content from multiple pages (as necessary or appropriate).
    • Saves content at reasonable file names (e.g. page titles or content IDs) to help easily browse downloaded content.
  • I've verified that this change did not break existing functionality (especially in the Ripper I modified).

Optional but recommended:

  • I've added a unit test to cover my change.

This closes #1245 and has to do with (but doesn't necessarily close) #1273, #1438, #1333, #961 (Setting cookies for twitter.com did not work, still getting 401 error).

TODO

There are two problems that might occur:

  1. A ripper breaks because cookies (that are already sent with a request) are overwritten (should not happen, see later)
  2. Sites change their layout when user is logged in and ripper breaks

While this change probably doesn't overwrite cookies that are already added to the connection (since connection.cookies() "Adds each of the supplied cookies to the request"; doesn't talk about overwriting), more manual testing needs to be done, especially with the following rippers (since they already use cookies):

  • ZizkiRipper
  • XcartxRipper
  • WebtoonsRipper
  • VscoRipper
  • TwodgalleriesRipper
  • TsuminoRipper
  • ThechiveRipper
  • StaRipper
  • SankakuComplexRipper
  • PhotobucketRipper
  • PahealRipper
  • ModelmayhemRipper
  • InstagramRipper
  • HentaifoundryRipper
  • AerisdiesRipper
  • CheveretoRipper
  • DeviantartRipper
  • EHentaiRipper
  • EightmusesRipper
  • FuraffinityRipper
  • FuskatorRipper

You can use the following jar I built to test these changes:

ripme-cookies-prerelease-jar-with-dependencies.zip

Another change that might be possible is adding an error message when the status code is 401 or 403 that tells the user about this feature.
The current error message looks like this: Failed to load https://www.reddit.com/r/waterniggas.json after 1 attempts, so this might need some work.

@cyian-1756
Copy link
Collaborator

This looks great, I'll get on testing it

While this doesn't make any difference, not using the argument is kind of bad.
One could also remove the argument and use `this.url` directly.
@cyian-1756
Copy link
Collaborator

After a bit of testing this looks good to me, but would you mind adding a logging statement somewhere that cookies are being added to the request?

@xarantolus
Copy link
Contributor Author

Sure, that makes sense.

I wasn't sure about which log level to use, so I chose info. If it should be another level or there are any other changes that could be made please feel free to point that out :)

@cyian-1756
Copy link
Collaborator

LGTM

@cyian-1756 cyian-1756 merged commit f9f0372 into RipMeApp:master Dec 3, 2019
prn39p pushed a commit to prn39p/ripme_old that referenced this pull request Dec 5, 2019
Add cookie support for all rippers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Quarantined subreddits cannot be downloaded
2 participants