Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image scraping support #370

Merged
merged 12 commits into from
Mar 11, 2020
Merged

Conversation

WithoutPants
Copy link
Collaborator

Resolves #344

Adds the ability to scrape performer images and scene cover images.

This change also introduces the subScraper xpath post-processing option. If subScraper appears in an attribute xpath configuration, then the sub-scraper will be executed after all other post-processes are complete. It then takes the value and performs an http request, using the value as the URL. Within the subScraper config is a nested scraping configuration. This allows you to traverse to other webpages to get the attribute value you are after.

For example, from the Boobpedia scraper config in #333 :

...
performerScraper:
  performer:
    # ..snip..
    Image:
      selector: //table[@class="infobox"]//tr[2]//a/@href
      # URL is a partial url, add the first part
      replace:
        - regex: ^
          with: http://www.boobpedia.com
      subScraper:
        selector: //div[@class="fullImageLink"]/a/@href
        replace:
          - regex: ^
            with: http://www.boobpedia.com

This fragment gets the URL from the xpath //table[@class="infobox"]//tr[2]//a/@href, adds the http://www.boobpedia.com prefix with the replace post-process. Then the sub-scraper post-process is run. It requests the document from the resulting URL, then gets the URL from //div[@class="fullImageLink"]/a/@href of the resulting page, followed by the replace post-process.

The Image value is expected to be a URL itself, which the system will subsequent request and encode.

Also adds image scraping to the stash scraper.

@WithoutPants WithoutPants added the feature Pull requests that add a new feature label Feb 13, 2020
pkg/scraper/image.go Outdated Show resolved Hide resolved
@bnkai
Copy link
Collaborator

bnkai commented Feb 13, 2020

Tests ok with me.
Images are fetched ok and saved with the boobpedia demo scraper.

@WithoutPants WithoutPants marked this pull request as ready for review March 2, 2020 23:34
@WithoutPants WithoutPants requested a review from bnkai March 2, 2020 23:46
@WithoutPants WithoutPants added this to the Version 0.2.0 milestone Mar 3, 2020
@WithoutPants
Copy link
Collaborator Author

Rebased and ported UI changes to 2.5. @bnkai can you please review on v2 and v2.5 of the UI?

@MrX292
Copy link
Contributor

MrX292 commented Mar 10, 2020

stashapp/CommunityScrapers#2 some scrapers for it

Copy link
Collaborator

@bnkai bnkai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested against v2 and v2.5 UI using the boobpedia and @MrX292 's mofos , newfreeones xpath scrapers.
Both scene and performer images seem to work fine.

@WithoutPants WithoutPants merged commit 34d8293 into stashapp:develop Mar 11, 2020
Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021
* Add sub-scraper functionality
* Add scraping of performer image
* Add scene cover image scraping
* Port UI changes to v2.5
* Fix v2.5 dialog suggest color
* Don't convert eol of UI to support pretty
@WithoutPants WithoutPants deleted the scrape_image branch February 4, 2021 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Pull requests that add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Add image support for scrapers
3 participants