Add Xpath post processing and performer name query #333

WithoutPants · 2020-01-24T02:38:26Z

Note: by necessity, this PR supercedes and extends #332. I can close the other PR, or just rebase this one when it is merged (ideally the latter).

This change adds some post-processing functionality to the xpath scraping configuration.

In an xpath scraper, a field value may now be either a string xpath selector value, or a sub-object.

If it is a sub-object, it must contain the selector field, which has the xpath selector value. Within the sub-object, further fields are available to perform post-processing:

concat: if an xpath matches multiple elements, and concat is present, then all of the elements will be concatenated together
replace: contains an array of sub-objects. Each sub-object must have a regex and with field. The regex field is the regex pattern to replace, and with is the string to replace it with. $ is used to reference capture groups - $1 is the first capture group, $2 the second and so on. Replacements are performed in order of the array.
parseDate: if present, the value is the date format using go's reference date (2006-01-02). For example, if an example date was 14-Mar-2003, then the date format would be 02-Jan-2006. See the time.Parse documentation for details. When present, the scraper will convert the input string into a date, then convert it to the string format used by stash (YYYY-MM-DD).

Post-processing is done in order of the fields above - concat, then regex, then parseDate.

Below are two example scrapers that will hopefully illustrate these concepts.

This Boobpedia scraper illustrates the concat and parseDate operations:

name: Boobpedia
performerByURL:
  - action: scrapeXPath
    url: 
      - boobpedia.com/boobs/
    scraper: performerScraper

xPathScrapers:
  performerScraper:
    performer:
      Name: //h1
      URL: //table//tr/td//b/a[text()='Official website']/@href
      Twitter: //table//tr/td/b/a[text()='Twitter']/@href
      Instagram: //table//tr/td/b/a[text()='Instagram']/@href
      Birthdate:
        selector: //table//tr/td//b[text()='Born:']/../following-sibling::td/a
        # two elements - concatenate together
        concat: " "
        parseDate: January 2 2006
      Ethnicity: //table//tr/td/b[text()='Ethnicity:']/../following-sibling::td/a
      Country: //table//tr/td/b[text()='Nationality:']/../following-sibling::td/a
      EyeColor: //table//tr/td/b[text()='Eye color:']/../following-sibling::td/a
      Height: //table//tr/td/b[text()='Height:']/../following-sibling::td
      Measurements: //table//tr/td/b[text()='Measurements:']/../following-sibling::td
      FakeTits: //table//tr/td/b[text()='Boobs:']/../following-sibling::td/a
      # nbsp; screws up the parsing, so use contains instead
      CareerLength: //table//tr/td/b[text()[contains(.,'active:')]]/../following-sibling::td
      Aliases: //table//tr/td/b[text()[contains(.,'known')]]/../following-sibling::td

This pornhub performer scraper illustrates replace and parseDate. I tested it against Mia Malkova's performer page on pornhub, since it had most of the information filled in:

name: Pornhub
performerByURL:
  - action: scrapeXPath
    url: 
      - pornhub.com
    scraper: performerScraper

xPathScrapers:
  performerScraper:
    common:
      $infoPiece: //div[@class="infoPiece"]/span
    performer:
      Name: //h1[@itemprop="name"]
      Birthdate: 
        selector: //span[@itemprop="birthDate"]
        parseDate: Jan 2, 2006
      Twitter: //span[text() = 'Twitter']/../@href
      Instagram: //span[text() = 'Instagram']/../@href
      Measurements: $infoPiece[text() = 'Measurements:']/../span[@class="smallInfo"]
      Height: 
        selector: $infoPiece[text() = 'Height:']/../span[@class="smallInfo"]
        replace: 
          - regex: .*\((\d+) cm\)
            with: $1
      Ethnicity: $infoPiece[text() = 'Ethnicity:']/../span[@class="smallInfo"]
      FakeTits: $infoPiece[text() = 'Fake Boobs:']/../span[@class="smallInfo"]
      Piercings: $infoPiece[text() = 'Piercings:']/../span[@class="smallInfo"]
      Tattoos: $infoPiece[text() = 'Tattoos:']/../span[@class="smallInfo"]
      CareerLength: 
        selector: $infoPiece[text() = 'Career Start and End:']/../span[@class="smallInfo"]
        replace:
          - regex: \s+to\s+
            with: "-"

WithoutPants · 2020-01-24T04:21:44Z

I have added the ability to use the xpath scraper to perform performer name queries from the "Scrape from..." drop down.

performerByName now accepts the scrapeXPath action, and a queryURL field.

The queryURL field is the URL to use to make the query for performer names, where the name input into the dialog is replaced in the URL, replacing the placeholder string sequence {}. For example, a queryURL of http://test.com/query/{} and a performer name of performer X will end up with a query URL of http://test.com/query/performer+X.

When scraping a performer by fragment (ie from the Scrape Performer dialog), the server will check for performerByFragment in the scraper config, then if it is not present and the URL on the fragment is set, it will try to perform a scrape via URL, as if the user input the URL into the URL field and clicked the scrape button.

To illustrate this, I've extended the Boobpedia scraper config. It should now be a pretty much feature-complete scraper for that site, fulfilling the requirements #310. Once this PR is merged, I will add this config to the community scrapers repo. See below:

name: Boobpedia
performerByName:
  action: scrapeXPath
  queryURL: http://www.boobpedia.com/wiki/index.php?title=Special%3ASearch&search={}&fulltext=Search
  scraper: performerSearch
performerByURL:
  - action: scrapeXPath
    url: 
      - boobpedia.com/boobs/
    scraper: performerScraper

xPathScrapers:
  performerSearch:
    performer:
      Name: //h2/span[text() = 'Page title matches']/../following-sibling::ul[1]/li//a
      URL: 
        selector: //h2/span[text() = 'Page title matches']/../following-sibling::ul[1]/li//a/@href
        # URL is a partial url, add the first part
        replace:
          - regex: ^
            with: http://www.boobpedia.com
  
  performerScraper:
    performer:
      Name: //h1
      URL: //table//tr/td//b/a[text()='Official website']/@href
      Twitter: //table//tr/td/b/a[text()='Twitter']/@href
      Instagram: //table//tr/td/b/a[text()='Instagram']/@href
      # need to add support for concatenating two elements or something
      Birthdate:
        selector: //table//tr/td//b[text()='Born:']/../following-sibling::td/a
        concat: " "
        # reference date is: 2006/01/02
        parseDate: January 2 2006
      Ethnicity: //table//tr/td/b[text()='Ethnicity:']/../following-sibling::td/a
      Country: //table//tr/td/b[text()='Nationality:']/../following-sibling::td/a
      EyeColor: //table//tr/td/b[text()='Eye color:']/../following-sibling::td/a
      Height: //table//tr/td/b[text()='Height:']/../following-sibling::td
      Measurements: //table//tr/td/b[text()='Measurements:']/../following-sibling::td
      FakeTits: //table//tr/td/b[text()='Boobs:']/../following-sibling::td/a
      # nbsp; screws up the parsing, so use contains instead
      CareerLength: //table//tr/td/b[text()[contains(.,'active:')]]/../following-sibling::td
      Aliases: //table//tr/td/b[text()[contains(.,'known')]]/../following-sibling::td

bnkai · 2020-01-24T18:55:15Z

It seems to work ok for me.

A note on the Boobpedia scraper example not the code.

On some older entries e.g. Mia Khalifa or Brandi Love the EyeColor attribute doesn't have an anchor so Eyecolor is not found. Removing the trailing /a from the Eyecolor matches both performers with and without the href.
The same goes for Country and e.g. Mia Khalifa. Mia has dual nationality so 2 comma separated anchors in a single td cell that fail to match. Removing the /a adds both Lebanese, American instead of only the Lebanese one.

A warning for the Birthdate on Brandi Love WARN[0357] Error parsing date string 'March 29 1973 USA' using format 'January 2 2006': parsing time "March 29 1973 USA": extra text: USA (USA has an anchor also) makes the birthdate null when saving.

It seems to me that the entries on Boobpedia are not that "standarized" so it is natural that we cant match everything at once.

WithoutPants · 2020-01-30T23:12:59Z

I've rebased to resolve conflicts. Please retest.

bnkai · 2020-01-31T10:52:22Z

Looks ok to me.

* Extend xpath configuration. Support concatenation * Add parseDate parsing option * Add regex replacements * Add xpath query performer by name * Fix loading spinner on scrape performer * Change ReplaceAll to Replace

WithoutPants changed the title ~~Add Xpath post processing~~ Add Xpath post processing and performer name query Jan 24, 2020

WithoutPants added the feature Pull requests that add a new feature label Jan 24, 2020

WithoutPants added 6 commits January 31, 2020 09:59

Extend xpath configuration. Support concatenation

7432ef1

Add parseDate parsing option

e6e76f2

Add regex replacements

1710226

Add xpath query performer by name

355109e

Fix loading spinner on scrape performer

27cee32

Change ReplaceAll to Replace

c50f90a

WithoutPants force-pushed the xpath_post_process branch from 3d21142 to c50f90a Compare January 30, 2020 23:03

This was referenced Jan 30, 2020

[RFC] v0.1 release #339

Closed

[RFC] Proposal to use yaml for scraping definitions #244

Closed

Leopere approved these changes Jan 31, 2020

View reviewed changes

Leopere merged commit 03c07a4 into stashapp:develop Jan 31, 2020

ghost pushed a commit to InfiniteStash/stash that referenced this pull request Feb 2, 2020

Add Xpath post processing and performer name query (stashapp#333)

07bb3b6

ghost pushed a commit to InfiniteStash/stash that referenced this pull request Feb 8, 2020

Add Xpath post processing and performer name query (stashapp#333)

8f9de9e

WithoutPants mentioned this pull request Feb 13, 2020

Add image scraping support #370

Merged

ghost pushed a commit to InfiniteStash/stash that referenced this pull request Mar 1, 2020

Add Xpath post processing and performer name query (stashapp#333)

fb5a49e

bnkai mentioned this pull request Apr 12, 2020

[Enhancement] Adding Boobpedia as a scraper (primarily for info on Fake/Natural breasts) #310

Closed

WithoutPants deleted the xpath_post_process branch May 15, 2020 07:13

Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021

Add Xpath post processing and performer name query (stashapp#333)

7355c87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Xpath post processing and performer name query #333

Add Xpath post processing and performer name query #333

WithoutPants commented Jan 24, 2020

WithoutPants commented Jan 24, 2020

bnkai commented Jan 24, 2020 •

edited

Loading

WithoutPants commented Jan 30, 2020

bnkai commented Jan 31, 2020

Add Xpath post processing and performer name query #333

Add Xpath post processing and performer name query #333

Conversation

WithoutPants commented Jan 24, 2020

WithoutPants commented Jan 24, 2020

bnkai commented Jan 24, 2020 • edited Loading

WithoutPants commented Jan 30, 2020

bnkai commented Jan 31, 2020

bnkai commented Jan 24, 2020 •

edited

Loading