-
-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xpath, improvement: add another xpath for image for Hustler #1370
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works for the image as intended.
However most of the selectors are not working, when scanning with the test links above., Performers, Date, Details are not parsed.
I've tweaked the xpath selectors for Title, Date, Performers, and Details
I think older scenes with descriptions must have something about the pages that means Details can't be scraped in the same way. In any case, the images work and I've now fixed all of the issues for new scenes and all but one (Details) of the issues for older scenes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it looks like for these old urls the only way to get the Details would be to have a subscraper call the JSON api... Would probably need to migrate to Python or Ruby for this. So I think this is the best we can do with the current XPath scraper
postProcess: | ||
- parseDate: Jan 02, 2006 | ||
Details: //meta[@property="og:description"]/@content|//div[@class="description"]/p | ||
Image: //div[@class="img-container"]/img/@src | ||
Details: //p[following-sibling::a[@class="clickable"]]|//meta[@property="og:description"]/@content|//div[@class="description"]/p |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Details: //p[following-sibling::a[@class="clickable"]]|//meta[@property="og:description"]/@content|//div[@class="description"]/p | |
Details: //div[@class="panel-content"]/div/div/text()|//meta[@property="og:description"]/@content|//div[@class="description"]/p |
gets part of the details for older scenes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, when I try this with the new scenes, it also gets the first part (before clicking Read More) rather than the full description... will see if I can come up with a selector combo that gets the full description for all scenes
xpath, improvement: add another xpath for image for Hustler
This adds another xpath selector for the existing Hustler.yaml scraper so that it can get the cover image for scene pages, e.g.
I left the existing selector in case there are different styles/layouts for scene pages where the existing selector will continue to work.