Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allow overriding the used parsers for binary, css, plain text #90

Merged
merged 1 commit into from
Jul 9, 2022

Conversation

brbog
Copy link

@brbog brbog commented Jul 9, 2022

Fixes #89 .

At the moment I'm using a custom css parsing solution based on https://github.com/phax/ph-css . This update would allow me to avoid copying logic from the Parser class.

The reason for the custom css parser is that the current regex solution can 't cope with urls like: background-image: url('leaves-medium (1920x1280).jpg'). When not crawling binary data, this parsing might be irrelevant, so adding a dependency might not be in everyone's interest. I'll continue this train of thought in a separate comment.

@brbog
Copy link
Author

brbog commented Jul 9, 2022

Custom css parser:
ph-css was chosen because it is actively developed and is used by Apache JMeter since version 3 (and still is used in the current version 5). It can cope with urls in a way that regexes can't, for example for background-image: url('leaves-medium (1920x1280).jpg') it correctly extracts leaves-medium (1920x1280).jpg, while the currect regex approach can't distinguish between a ')' in the content or as the end of the url()-statement => current implementation extracts leaves-medium (1920x1280.

The current implementation could be replaced by it (ph-css doesn't pull a lot of extra dependencies with it), but maybe you want to offer the choice of css parsing solution through a Service Provider Interface?

@rzo1
Copy link
Collaborator

rzo1 commented Jul 9, 2022

Thanks for the PR!

I have no objection in using an additional css parser library, if it helps to make the previous home-crown code better. Don't know, if the majority of users now how to handle SPI but would be a nice solution for exchanging the HTTP or CSS parser. Would you mind to create an issue for it?

@rzo1 rzo1 merged commit 82bea98 into HHN:master Jul 9, 2022
@rzo1 rzo1 added this to the v4.9.2 milestone Jul 9, 2022
@brbog
Copy link
Author

brbog commented Jul 9, 2022

My bad, I thought the frontiers where also using SPI (forgot how it was instantiated). There are other ways to get rid of the dependency if people don't want it, so no need to make things more confusing :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

enhancement: allow overriding the used parsers for binary, css, plain text
2 participants