Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Xpath post processing and performer name query #333

Merged
merged 6 commits into from
Jan 31, 2020

Conversation

WithoutPants
Copy link
Collaborator

Note: by necessity, this PR supercedes and extends #332. I can close the other PR, or just rebase this one when it is merged (ideally the latter).

This change adds some post-processing functionality to the xpath scraping configuration.

In an xpath scraper, a field value may now be either a string xpath selector value, or a sub-object.

If it is a sub-object, it must contain the selector field, which has the xpath selector value. Within the sub-object, further fields are available to perform post-processing:

concat: if an xpath matches multiple elements, and concat is present, then all of the elements will be concatenated together
replace: contains an array of sub-objects. Each sub-object must have a regex and with field. The regex field is the regex pattern to replace, and with is the string to replace it with. $ is used to reference capture groups - $1 is the first capture group, $2 the second and so on. Replacements are performed in order of the array.
parseDate: if present, the value is the date format using go's reference date (2006-01-02). For example, if an example date was 14-Mar-2003, then the date format would be 02-Jan-2006. See the time.Parse documentation for details. When present, the scraper will convert the input string into a date, then convert it to the string format used by stash (YYYY-MM-DD).

Post-processing is done in order of the fields above - concat, then regex, then parseDate.

Below are two example scrapers that will hopefully illustrate these concepts.

This Boobpedia scraper illustrates the concat and parseDate operations:

name: Boobpedia
performerByURL:
  - action: scrapeXPath
    url: 
      - boobpedia.com/boobs/
    scraper: performerScraper

xPathScrapers:
  performerScraper:
    performer:
      Name: //h1
      URL: //table//tr/td//b/a[text()='Official website']/@href
      Twitter: //table//tr/td/b/a[text()='Twitter']/@href
      Instagram: //table//tr/td/b/a[text()='Instagram']/@href
      Birthdate:
        selector: //table//tr/td//b[text()='Born:']/../following-sibling::td/a
        # two elements - concatenate together
        concat: " "
        parseDate: January 2 2006
      Ethnicity: //table//tr/td/b[text()='Ethnicity:']/../following-sibling::td/a
      Country: //table//tr/td/b[text()='Nationality:']/../following-sibling::td/a
      EyeColor: //table//tr/td/b[text()='Eye color:']/../following-sibling::td/a
      Height: //table//tr/td/b[text()='Height:']/../following-sibling::td
      Measurements: //table//tr/td/b[text()='Measurements:']/../following-sibling::td
      FakeTits: //table//tr/td/b[text()='Boobs:']/../following-sibling::td/a
      # nbsp; screws up the parsing, so use contains instead
      CareerLength: //table//tr/td/b[text()[contains(.,'active:')]]/../following-sibling::td
      Aliases: //table//tr/td/b[text()[contains(.,'known')]]/../following-sibling::td

This pornhub performer scraper illustrates replace and parseDate. I tested it against Mia Malkova's performer page on pornhub, since it had most of the information filled in:

name: Pornhub
performerByURL:
  - action: scrapeXPath
    url: 
      - pornhub.com
    scraper: performerScraper

xPathScrapers:
  performerScraper:
    common:
      $infoPiece: //div[@class="infoPiece"]/span
    performer:
      Name: //h1[@itemprop="name"]
      Birthdate: 
        selector: //span[@itemprop="birthDate"]
        parseDate: Jan 2, 2006
      Twitter: //span[text() = 'Twitter']/../@href
      Instagram: //span[text() = 'Instagram']/../@href
      Measurements: $infoPiece[text() = 'Measurements:']/../span[@class="smallInfo"]
      Height: 
        selector: $infoPiece[text() = 'Height:']/../span[@class="smallInfo"]
        replace: 
          - regex: .*\((\d+) cm\)
            with: $1
      Ethnicity: $infoPiece[text() = 'Ethnicity:']/../span[@class="smallInfo"]
      FakeTits: $infoPiece[text() = 'Fake Boobs:']/../span[@class="smallInfo"]
      Piercings: $infoPiece[text() = 'Piercings:']/../span[@class="smallInfo"]
      Tattoos: $infoPiece[text() = 'Tattoos:']/../span[@class="smallInfo"]
      CareerLength: 
        selector: $infoPiece[text() = 'Career Start and End:']/../span[@class="smallInfo"]
        replace:
          - regex: \s+to\s+
            with: "-"

@WithoutPants WithoutPants changed the title Add Xpath post processing Add Xpath post processing and performer name query Jan 24, 2020
@WithoutPants
Copy link
Collaborator Author

I have added the ability to use the xpath scraper to perform performer name queries from the "Scrape from..." drop down.

performerByName now accepts the scrapeXPath action, and a queryURL field.

The queryURL field is the URL to use to make the query for performer names, where the name input into the dialog is replaced in the URL, replacing the placeholder string sequence {}. For example, a queryURL of http://test.com/query/{} and a performer name of performer X will end up with a query URL of http://test.com/query/performer+X.

When scraping a performer by fragment (ie from the Scrape Performer dialog), the server will check for performerByFragment in the scraper config, then if it is not present and the URL on the fragment is set, it will try to perform a scrape via URL, as if the user input the URL into the URL field and clicked the scrape button.

To illustrate this, I've extended the Boobpedia scraper config. It should now be a pretty much feature-complete scraper for that site, fulfilling the requirements #310. Once this PR is merged, I will add this config to the community scrapers repo. See below:

name: Boobpedia
performerByName:
  action: scrapeXPath
  queryURL: http://www.boobpedia.com/wiki/index.php?title=Special%3ASearch&search={}&fulltext=Search
  scraper: performerSearch
performerByURL:
  - action: scrapeXPath
    url: 
      - boobpedia.com/boobs/
    scraper: performerScraper

xPathScrapers:
  performerSearch:
    performer:
      Name: //h2/span[text() = 'Page title matches']/../following-sibling::ul[1]/li//a
      URL: 
        selector: //h2/span[text() = 'Page title matches']/../following-sibling::ul[1]/li//a/@href
        # URL is a partial url, add the first part
        replace:
          - regex: ^
            with: http://www.boobpedia.com
  
  performerScraper:
    performer:
      Name: //h1
      URL: //table//tr/td//b/a[text()='Official website']/@href
      Twitter: //table//tr/td/b/a[text()='Twitter']/@href
      Instagram: //table//tr/td/b/a[text()='Instagram']/@href
      # need to add support for concatenating two elements or something
      Birthdate:
        selector: //table//tr/td//b[text()='Born:']/../following-sibling::td/a
        concat: " "
        # reference date is: 2006/01/02
        parseDate: January 2 2006
      Ethnicity: //table//tr/td/b[text()='Ethnicity:']/../following-sibling::td/a
      Country: //table//tr/td/b[text()='Nationality:']/../following-sibling::td/a
      EyeColor: //table//tr/td/b[text()='Eye color:']/../following-sibling::td/a
      Height: //table//tr/td/b[text()='Height:']/../following-sibling::td
      Measurements: //table//tr/td/b[text()='Measurements:']/../following-sibling::td
      FakeTits: //table//tr/td/b[text()='Boobs:']/../following-sibling::td/a
      # nbsp; screws up the parsing, so use contains instead
      CareerLength: //table//tr/td/b[text()[contains(.,'active:')]]/../following-sibling::td
      Aliases: //table//tr/td/b[text()[contains(.,'known')]]/../following-sibling::td

@WithoutPants WithoutPants added the feature Pull requests that add a new feature label Jan 24, 2020
@bnkai
Copy link
Collaborator

bnkai commented Jan 24, 2020

It seems to work ok for me.

A note on the Boobpedia scraper example not the code.

On some older entries e.g. Mia Khalifa or Brandi Love the EyeColor attribute doesn't have an anchor so Eyecolor is not found. Removing the trailing /a from the Eyecolor matches both performers with and without the href.
The same goes for Country and e.g. Mia Khalifa. Mia has dual nationality so 2 comma separated anchors in a single td cell that fail to match. Removing the /a adds both Lebanese, American instead of only the Lebanese one.

A warning for the Birthdate on Brandi Love WARN[0357] Error parsing date string 'March 29 1973 USA' using format 'January 2 2006': parsing time "March 29 1973 USA": extra text: USA (USA has an anchor also) makes the birthdate null when saving.

It seems to me that the entries on Boobpedia are not that "standarized" so it is natural that we cant match everything at once.

@WithoutPants
Copy link
Collaborator Author

I've rebased to resolve conflicts. Please retest.

@bnkai
Copy link
Collaborator

bnkai commented Jan 31, 2020

Looks ok to me.

@Leopere Leopere merged commit 03c07a4 into stashapp:develop Jan 31, 2020
ghost pushed a commit to InfiniteStash/stash that referenced this pull request Feb 2, 2020
ghost pushed a commit to InfiniteStash/stash that referenced this pull request Feb 8, 2020
ghost pushed a commit to InfiniteStash/stash that referenced this pull request Mar 1, 2020
@WithoutPants WithoutPants deleted the xpath_post_process branch May 15, 2020 07:13
Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021
* Extend xpath configuration. Support concatenation

* Add parseDate parsing option

* Add regex replacements

* Add xpath query performer by name

* Fix loading spinner on scrape performer

* Change ReplaceAll to Replace
Tweeticoats pushed a commit to Tweeticoats/stash that referenced this pull request Feb 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Pull requests that add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants