Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add an "excludes" field to page selectors #239

Closed
kevinschaul opened this issue Feb 6, 2019 · 1 comment
Closed

Proposal: Add an "excludes" field to page selectors #239

kevinschaul opened this issue Feb 6, 2019 · 1 comment

Comments

@kevinschaul
Copy link
Contributor

Hi Klaxon team,

First of all, thanks for open sourcing such an incredible tool. I've been experimenting with Klaxon to track changes in 2020 candidates's websites, and it has been super simple to use. Stellar work.

I am running into an issue, though, possibly because of the way I'm trying to use Klaxon. Because I'm interested in almost all changes in a candidate's website, I've been using body as my selector. That works great for some sites but picks up a lot of junk for others.

The website Tim Ryan For Congress includes a section of his latest Twitter posts. I really don't care about those for the purposes of this (and these changes are filling up my free Heroku database fast). I think I can get around this limitation by using a :not() in my selector for this particular site. Luckily all content appears in a header or section element, and all these elements are siblings of each other, so I should be able to write a simple CSS selector to exclude the Tweets. But if the site were more complicated, I fear I'd run into some nokogiri limitations on its CSS selector logic that don't seem likely to be resolved until nokogiri v2.0.0.

I'd like to propose adding some sort of exclude option for pages. I am envisioning the option as a CSS selector (or selectors). Any changes that occur within the excludes selectors would be ignored. In my Tim Ryan example above, my selector could remain body but I could exclude #feed-section.

What do you all think of an excludes option?

I am happy to try implementing this feature (although my Ruby skills are roughly nonexistent). I think it could be done fairly surgically by adjusting (Page.match_text)[https://github.com/themarshallproject/klaxon/blob/develop/app/models/page.rb#L39-L41] to do the following:

  1. Match the document CSS selector
  2. Within that fragment, match the excludes CSS selector. For each of these matches, set the text to an empty string.
  3. Use nokogiri's .text method to return the text of everything else

Please let me know if you have any thoughts or considerations. And again, props to releasing such a great project.

@GabeIsman
Copy link
Member

Hey Kevin,

Thanks for the kind words. This seems like a great idea to me, and I think you've identified a very clean implementation plan.

I think the one thing left to consider is where and how users will edit these excludes selectors. I think it would be too much/overcomplicated to add this feature to the bookmarklet. It should be a sort of "advanced" feature for power users. So I think you could just get away with adding a field to the Page edit page.

Note that you shouldn't need multiple exclude selectors, since you can just use a disjoint selector to represent this case (e.g. #feed-section, #updates-section).

Let me know if you want any guidance on navigating rails, and thanks for contributing!

kevinschaul added a commit to kevinschaul/klaxon that referenced this issue Feb 23, 2019
kevinschaul added a commit to kevinschaul/klaxon that referenced this issue Feb 23, 2019
Add exclude_selector to relevant views

See themarshallproject#239
kevinschaul added a commit to kevinschaul/klaxon that referenced this issue Feb 24, 2019
- Add exclude_selector to Page
- Adjust Page.match_text and PageSnapshot.match_text to use exclude_selector
- Expose exclude_selector in relevant views
- Tests
- Documentation

Fixes themarshallproject#239
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants